The train_test_split utility doesn't respect original order when shuffle=False

## 🐛 Bug
The train_test_split code in litdata/utilities/train_test_split.py seems to rely on the alphabetical filename ordering of chunks, which is unreliable.
Stepping through the code I noticed the very first split contains files
curr_chunk_filename=['chunk-0-0.bin', 'chunk-1-0.bin', 'chunk-10-0.bin', 'chunk-2-0.bin', 'chunk-3-0.bin', 'chunk-4-0.bin', 'chunk-5-0.bin', 'chunk-6-0.bin']
on L91
Notice that chunk 10 snuck in there.

From testing it seems there are 2 failure paths:
- more than 10 chunks are written per worker
- more than 10 workers are used
As soon as a chunk has the name `chunk-10-x` or `chunk-x-10` it no longer sorts properly.

This is especially relevant when working with temporal data.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The train_test_split utility doesn't respect original order when shuffle=False #826

🐛 Bug

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

The train_test_split utility doesn't respect original order when shuffle=False #826

Description

🐛 Bug

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions