🐛 Bug
The default low_memory=True parameter of ParquetLoader causes, polars.exceptions.OutOfBoundsError: index N is out of bounds for sequence of length N in some cases.
This relates to #553
Why (from a failing example):
index_entry.chunk_size = 793
parquet_num_rows = 793
-> file-level counts match perfectly.
Row groups are:
[128, 128, 128, 128, 128, 128, 14, 11]
cumulative ends: 128, 256, 384, 512, 640, 768, 782, 793
For row_index=782:
correct mapping is last row group (size 11), local index 0
litdata instead picked previous row group (size 14) and tried local index 14 -> OOB (0..13 only)
So index 14 appears because it computed: 782 - 768 = 14 on the wrong row group.
To Reproduce
Steps to reproduce the behavior...
Code sample
"""Minimal reproducible example for litdata ParquetLoader OutOfBoundsError.
Bug location: litdata/streaming/item_loader.py (~line 724)
num_rows_per_row_group = parquet_file.metadata.row_group(0).num_rows
row_group_index = row_index // num_rows_per_row_group
row_index_within_group = row_index % num_rows_per_row_group
The loader reads row_group(0).num_rows once and pretends every row group has
that size. When the parquet file contains row groups of unequal sizes where
row_group(0) is larger than a later row_group, for some row_index the computed
row_index_within_group exceeds the actual number of rows in that smaller
row_group, and Polars raises:
polars.exceptions.OutOfBoundsError: index N is out of bounds for sequence of length N
"""
from __future__ import annotations
import shutil
import tempfile
from pathlib import Path
import pyarrow as pa
import pyarrow.parquet as pq
import litdata
from litdata import StreamingDataset
from litdata.streaming.item_loader import ParquetLoader
# Row group layout: row_group(0) is larger than the rest.
# With groups [100, 50, 50, 50, 50]:
# total = 300 rows. For row_index = 150, loader computes
# row_group_index = 150 // 100 = 1
# row_index_within_group = 150 % 100 = 50
# but row_group(1) has only 50 rows (max valid index 49) → crash.
FIRST_GROUP_ROWS = 100
OTHER_GROUP_ROWS = 50
NUM_OTHER_GROUPS = 4
def write_nonuniform_parquet(path: Path) -> None:
"""Write a parquet file where row_group(0) has more rows than the rest."""
schema = pa.schema([("value", pa.int64())])
with pq.ParquetWriter(path, schema) as writer:
first = pa.table({"value": list(range(FIRST_GROUP_ROWS))}, schema=schema)
writer.write_table(first, row_group_size=FIRST_GROUP_ROWS)
for g in range(NUM_OTHER_GROUPS):
start = FIRST_GROUP_ROWS + g * OTHER_GROUP_ROWS
tbl = pa.table(
{"value": list(range(start, start + OTHER_GROUP_ROWS))},
schema=schema,
)
writer.write_table(tbl, row_group_size=OTHER_GROUP_ROWS)
def main() -> None:
tmp = Path(tempfile.mkdtemp(prefix="litdata_bug_"))
print(f"workdir: {tmp}")
try:
data_dir = tmp / "parquet_data"
data_dir.mkdir()
write_nonuniform_parquet(data_dir / "chunk_0000.parquet")
pf = pq.ParquetFile(data_dir / "chunk_0000.parquet")
sizes = [pf.metadata.row_group(i).num_rows for i in range(pf.num_row_groups)]
print(f"row_group sizes: {sizes} total rows: {sum(sizes)}")
litdata.index_parquet_dataset(str(data_dir))
dataset = StreamingDataset(
input_dir=str(data_dir),
cache_dir=str(tmp / "cache"),
item_loader=ParquetLoader(low_memory=True),
shuffle=False,
drop_last=False,
)
print(f"dataset length: {len(dataset)}")
for i, item in enumerate(dataset):
if i % 25 == 0:
print(f" row {i}: {item}")
finally:
shutil.rmtree(tmp, ignore_errors=True)
if __name__ == "__main__":
main()
Alternatively, you can share a fully reproducible Lightning Studio environment:
A simple guide on how to create such a studio can be found here.
- Create a Studio.
- Reproduce the issue in the Studio.
- Publish the Studio.
- Paste the Studio link here.
Expected behavior
Additional context
Environment detail
- PyTorch Version (e.g., 1.0):
- OS (e.g., Linux):
- How you installed PyTorch (
conda, pip, source):
- Build command you used (if compiling from source):
- Python version:
- CUDA/cuDNN version:
- GPU models and configuration:
- Any other relevant information:
🐛 Bug
The default
low_memory=Trueparameter ofParquetLoadercauses,polars.exceptions.OutOfBoundsError: index N is out of bounds for sequence of length Nin some cases.This relates to #553
Why (from a failing example):
index_entry.chunk_size = 793
parquet_num_rows = 793
-> file-level counts match perfectly.
Row groups are:
[128, 128, 128, 128, 128, 128, 14, 11]
cumulative ends: 128, 256, 384, 512, 640, 768, 782, 793
For row_index=782:
correct mapping is last row group (size 11), local index 0
litdata instead picked previous row group (size 14) and tried local index 14 -> OOB (0..13 only)
So index 14 appears because it computed: 782 - 768 = 14 on the wrong row group.
To Reproduce
Steps to reproduce the behavior...
Code sample
Alternatively, you can share a fully reproducible Lightning Studio environment:
Expected behavior
Additional context
Environment detail
conda,pip, source):