Skip to content

polars.exceptions.OutOfBoundsError in ParquetLoader with low_memory=True #809

@tdfoust

Description

@tdfoust

🐛 Bug

The default low_memory=True parameter of ParquetLoader causes, polars.exceptions.OutOfBoundsError: index N is out of bounds for sequence of length N in some cases.

This relates to #553

Why (from a failing example):

index_entry.chunk_size = 793
parquet_num_rows = 793
-> file-level counts match perfectly.
Row groups are:

[128, 128, 128, 128, 128, 128, 14, 11]
cumulative ends: 128, 256, 384, 512, 640, 768, 782, 793
For row_index=782:

correct mapping is last row group (size 11), local index 0
litdata instead picked previous row group (size 14) and tried local index 14 -> OOB (0..13 only)
So index 14 appears because it computed: 782 - 768 = 14 on the wrong row group.

To Reproduce

Steps to reproduce the behavior...

Code sample
"""Minimal reproducible example for litdata ParquetLoader OutOfBoundsError.
Bug location: litdata/streaming/item_loader.py (~line 724)
    num_rows_per_row_group = parquet_file.metadata.row_group(0).num_rows
    row_group_index = row_index // num_rows_per_row_group
    row_index_within_group = row_index % num_rows_per_row_group
The loader reads row_group(0).num_rows once and pretends every row group has
that size. When the parquet file contains row groups of unequal sizes where
row_group(0) is larger than a later row_group, for some row_index the computed
row_index_within_group exceeds the actual number of rows in that smaller
row_group, and Polars raises:
    polars.exceptions.OutOfBoundsError: index N is out of bounds for sequence of length N
"""
from __future__ import annotations
import shutil
import tempfile
from pathlib import Path
import pyarrow as pa
import pyarrow.parquet as pq
import litdata
from litdata import StreamingDataset
from litdata.streaming.item_loader import ParquetLoader
# Row group layout: row_group(0) is larger than the rest.
# With groups [100, 50, 50, 50, 50]:
#   total = 300 rows. For row_index = 150, loader computes
#       row_group_index = 150 // 100 = 1
#       row_index_within_group = 150 % 100 = 50
#   but row_group(1) has only 50 rows (max valid index 49) → crash.
FIRST_GROUP_ROWS = 100
OTHER_GROUP_ROWS = 50
NUM_OTHER_GROUPS = 4
def write_nonuniform_parquet(path: Path) -> None:
    """Write a parquet file where row_group(0) has more rows than the rest."""
    schema = pa.schema([("value", pa.int64())])
    with pq.ParquetWriter(path, schema) as writer:
        first = pa.table({"value": list(range(FIRST_GROUP_ROWS))}, schema=schema)
        writer.write_table(first, row_group_size=FIRST_GROUP_ROWS)
        for g in range(NUM_OTHER_GROUPS):
            start = FIRST_GROUP_ROWS + g * OTHER_GROUP_ROWS
            tbl = pa.table(
                {"value": list(range(start, start + OTHER_GROUP_ROWS))},
                schema=schema,
            )
            writer.write_table(tbl, row_group_size=OTHER_GROUP_ROWS)
def main() -> None:
    tmp = Path(tempfile.mkdtemp(prefix="litdata_bug_"))
    print(f"workdir: {tmp}")
    try:
        data_dir = tmp / "parquet_data"
        data_dir.mkdir()
        write_nonuniform_parquet(data_dir / "chunk_0000.parquet")
        pf = pq.ParquetFile(data_dir / "chunk_0000.parquet")
        sizes = [pf.metadata.row_group(i).num_rows for i in range(pf.num_row_groups)]
        print(f"row_group sizes: {sizes}  total rows: {sum(sizes)}")
        litdata.index_parquet_dataset(str(data_dir))
        dataset = StreamingDataset(
            input_dir=str(data_dir),
            cache_dir=str(tmp / "cache"),
            item_loader=ParquetLoader(low_memory=True),
            shuffle=False,
            drop_last=False,
        )
        print(f"dataset length: {len(dataset)}")
        for i, item in enumerate(dataset):
            if i % 25 == 0:
                print(f"  row {i}: {item}")
    finally:
        shutil.rmtree(tmp, ignore_errors=True)
if __name__ == "__main__":
    main()

Alternatively, you can share a fully reproducible Lightning Studio environment:

A simple guide on how to create such a studio can be found here.

  1. Create a Studio.
  2. Reproduce the issue in the Studio.
  3. Publish the Studio.
  4. Paste the Studio link here.

Expected behavior

Additional context

Environment detail
  • PyTorch Version (e.g., 1.0):
  • OS (e.g., Linux):
  • How you installed PyTorch (conda, pip, source):
  • Build command you used (if compiling from source):
  • Python version:
  • CUDA/cuDNN version:
  • GPU models and configuration:
  • Any other relevant information:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions