polars.exceptions.OutOfBoundsError in ParquetLoader with `low_memory=True`

## 🐛 Bug

The default `low_memory=True` parameter of `ParquetLoader` causes, `polars.exceptions.OutOfBoundsError: index N is out of bounds for sequence of length N` in some cases.

This relates to #553 

Why (from a failing example):

index_entry.chunk_size = 793
parquet_num_rows = 793
-> file-level counts match perfectly.
Row groups are:

[128, 128, 128, 128, 128, 128, 14, 11]
cumulative ends: 128, 256, 384, 512, 640, 768, 782, 793
For row_index=782:

correct mapping is last row group (size 11), local index 0
litdata instead picked previous row group (size 14) and tried local index 14 -> OOB (0..13 only)
So index 14 appears because it computed: 782 - 768 = 14 on the wrong row group.

### To Reproduce

Steps to reproduce the behavior...



<details>
  <summary>Code sample</summary>

```python
"""Minimal reproducible example for litdata ParquetLoader OutOfBoundsError.
Bug location: litdata/streaming/item_loader.py (~line 724)
    num_rows_per_row_group = parquet_file.metadata.row_group(0).num_rows
    row_group_index = row_index // num_rows_per_row_group
    row_index_within_group = row_index % num_rows_per_row_group
The loader reads row_group(0).num_rows once and pretends every row group has
that size. When the parquet file contains row groups of unequal sizes where
row_group(0) is larger than a later row_group, for some row_index the computed
row_index_within_group exceeds the actual number of rows in that smaller
row_group, and Polars raises:
    polars.exceptions.OutOfBoundsError: index N is out of bounds for sequence of length N
"""
from __future__ import annotations
import shutil
import tempfile
from pathlib import Path
import pyarrow as pa
import pyarrow.parquet as pq
import litdata
from litdata import StreamingDataset
from litdata.streaming.item_loader import ParquetLoader
# Row group layout: row_group(0) is larger than the rest.
# With groups [100, 50, 50, 50, 50]:
#   total = 300 rows. For row_index = 150, loader computes
#       row_group_index = 150 // 100 = 1
#       row_index_within_group = 150 % 100 = 50
#   but row_group(1) has only 50 rows (max valid index 49) → crash.
FIRST_GROUP_ROWS = 100
OTHER_GROUP_ROWS = 50
NUM_OTHER_GROUPS = 4
def write_nonuniform_parquet(path: Path) -> None:
    """Write a parquet file where row_group(0) has more rows than the rest."""
    schema = pa.schema([("value", pa.int64())])
    with pq.ParquetWriter(path, schema) as writer:
        first = pa.table({"value": list(range(FIRST_GROUP_ROWS))}, schema=schema)
        writer.write_table(first, row_group_size=FIRST_GROUP_ROWS)
        for g in range(NUM_OTHER_GROUPS):
            start = FIRST_GROUP_ROWS + g * OTHER_GROUP_ROWS
            tbl = pa.table(
                {"value": list(range(start, start + OTHER_GROUP_ROWS))},
                schema=schema,
            )
            writer.write_table(tbl, row_group_size=OTHER_GROUP_ROWS)
def main() -> None:
    tmp = Path(tempfile.mkdtemp(prefix="litdata_bug_"))
    print(f"workdir: {tmp}")
    try:
        data_dir = tmp / "parquet_data"
        data_dir.mkdir()
        write_nonuniform_parquet(data_dir / "chunk_0000.parquet")
        pf = pq.ParquetFile(data_dir / "chunk_0000.parquet")
        sizes = [pf.metadata.row_group(i).num_rows for i in range(pf.num_row_groups)]
        print(f"row_group sizes: {sizes}  total rows: {sum(sizes)}")
        litdata.index_parquet_dataset(str(data_dir))
        dataset = StreamingDataset(
            input_dir=str(data_dir),
            cache_dir=str(tmp / "cache"),
            item_loader=ParquetLoader(low_memory=True),
            shuffle=False,
            drop_last=False,
        )
        print(f"dataset length: {len(dataset)}")
        for i, item in enumerate(dataset):
            if i % 25 == 0:
                print(f"  row {i}: {item}")
    finally:
        shutil.rmtree(tmp, ignore_errors=True)
if __name__ == "__main__":
    main()

```

</details>

Alternatively, you can share a fully reproducible [Lightning Studio](https://lightning.ai/studios) environment:

> A simple guide on how to create such a studio can be found [here](https://www.youtube.com/watch?v=YcW-2Zt_bFg&ab_channel=LightningAI).

1. Create a [Studio](https://lightning.ai/studios).
2. Reproduce the issue in the Studio.
3. [Publish the Studio](https://lightning.ai/docs/overview/studios/publishing#how-to-publish).
4. Paste the Studio link here.

### Expected behavior



### Additional context



<details>
  <summary>Environment detail</summary>

- PyTorch Version (e.g., 1.0):
- OS (e.g., Linux):
- How you installed PyTorch (`conda`, `pip`, source):
- Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polars.exceptions.OutOfBoundsError in ParquetLoader with `low_memory=True` #809

🐛 Bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

polars.exceptions.OutOfBoundsError in ParquetLoader with low_memory=True #809

Description

🐛 Bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

polars.exceptions.OutOfBoundsError in ParquetLoader with `low_memory=True` #809