Skip to content

Fix row index miscalculation in ParquetLoader#810

Open
vini-fda wants to merge 7 commits into
Lightning-AI:mainfrom
vini-fda:fix/parquet-loader-outofbounds
Open

Fix row index miscalculation in ParquetLoader#810
vini-fda wants to merge 7 commits into
Lightning-AI:mainfrom
vini-fda:fix/parquet-loader-outofbounds

Conversation

@vini-fda
Copy link
Copy Markdown
Contributor

@vini-fda vini-fda commented Apr 24, 2026

Before submitting
  • Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?

What does this PR do?

Fixes #809.

This replaces the uniform-size arithmetic with a per-chunk prefix-sum of row-group sizes, computed once when the ParquetFile is first opened, then use bisect.bisect_right(offsets, row_index) - 1 to locate the group and row_index - offsets[group] for the offset inside it.

The same num_rows_per_row_group value is also used in the cache-eviction check at

if read_count >= num_rows_per_row_group:
.

But that check needs the actual size of the current group (offsets[g+1] - offsets[g]), otherwise with uneven groups memory either leaks (never hits threshold) or is freed too early (forcing re-reads).

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

❌ Patch coverage is 89.47368% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 81%. Comparing base (1fdfad7) to head (3bdc6b5).

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #810   +/-   ##
===================================
  Coverage    81%    81%           
===================================
  Files        54     54           
  Lines      7617   7630   +13     
===================================
+ Hits       6143   6157   +14     
+ Misses     1474   1473    -1     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 26, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 90.47619% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 81%. Comparing base (5213544) to head (ea2d8b8).
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #810   +/-   ##
===================================
- Coverage    81%    81%   -0%     
===================================
  Files        54     54           
  Lines      7617   7631   +14     
===================================
+ Hits       6144   6155   +11     
- Misses     1473   1476    +3     
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vini-fda vini-fda force-pushed the fix/parquet-loader-outofbounds branch from 5eddfe6 to bac1292 Compare May 26, 2026 16:29
@vini-fda
Copy link
Copy Markdown
Contributor Author

vini-fda commented May 26, 2026

@Borda I had to fix three unrelated issues that were causing CI problems:

  1. Example on macos-14, python 3.11: https://github.com/Lightning-AI/litData/actions/runs/26457907073/job/77897116939?pr=810

    Tests using clean_pq_index_cache all share ~/.lightning/chunks and the fixture shutil.rmtrees it on setup/teardown. Under pytest -n 2 two of them landed on different xdist workers and deleted each other's chunks mid-iteration. Fixed in bac1292 by auto-tagging those tests with xdist_group("hf_default_cache") and adding --dist=loadgroup to the CI invocation so they serialize onto one worker. The first attempt still scattered them across workers because my pytest_collection_modifyitems hook ran after xdist's own hook had already finalized nodeids; 8e3ffed adds @pytest.hookimpl(tryfirst=True) so the marker is in place before xdist appends the @group suffix that LoadGroupScheduling uses for routing.

  2. Example on windows-2022, python 3.11: https://github.com/Lightning-AI/litData/actions/runs/26457907073/job/77897116885?pr=810

    get_parquet_indexer_cls runs urlparse on the input path. On Windows a path like C:\Users\... parses with scheme='c', so it failed the scheme in ("local", "") check and raised ValueError. Fixed in bcb13c9 by treating any single-letter alphabetic scheme as a Windows drive letter and dispatching to LocalParquetDir.

  3. Example on ubuntu-22.04, python 3.11: optimize() writes intermediate chunks to tempfile.gettempdir() + "/chunks" (i.e. /tmp/chunks on Linux), configurable via DATA_OPTIMIZER_CACHE_FOLDER. Two streaming tests calling optimize() on different xdist workers both wrote to the same /tmp/chunks and the upload worker raced its sibling's cleanup, producing FileNotFoundError: '/tmp/chunks/chunk-0-0.bin'. Every optimize()-using test in tests/processing/test_data_processor.py already monkeypatches that env var to a per-test tmp dir, and the two streaming tests just didn't. Fixed in ea2d8b8 by adding the same monkeypatch.

@Borda
Copy link
Copy Markdown
Collaborator

Borda commented May 26, 2026

Borda I had to fix three unrelated issues that were causing CI problems

Thank you for reaching out, however you shall ping maintainers ⚡

@vinicius-freitas-nubank
Copy link
Copy Markdown

@tchaton @justusschock Any feedback here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

polars.exceptions.OutOfBoundsError in ParquetLoader with low_memory=True

4 participants