Fix AttributeError in ArrowDataset on default split_ratio#528
Fix AttributeError in ArrowDataset on default split_ratio#528aman-source wants to merge 1 commit into
Conversation
ArrowDataset.__init__ only set self.start_file_idx inside branches handling a non-trivial split. With split_ratio == 1.0 (the default), the attribute was never assigned and _map_to_file_idx raised AttributeError on the first access (e.g. ds[0]). Set start_file_idx = 0 before the if/elif/else so all branches produce a usable dataset; the existing assignments in the split branches harmlessly override. Signed-off-by: Aman Shaik <amanabdul21@gmail.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
Summary by CodeRabbit
Walkthrough
ChangesDefault split_ratio initialization fix
🎯 2 (Simple) | ⏱️ ~10 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Thanks for the fix! |
Purpose
Fix a crash in
ArrowDatasetwhen constructed with the defaultsplit_ratio=1.0._map_to_file_idx(and therefore__getitem__) readsself.start_file_idxunconditionally, but__init__only sets that attribute inside the branches handling a non-trivial split. Withsplit_ratio == 1.0, the attribute is never assigned and the first sample access raisesAttributeError.The bug is latent in current in-repo usage because
scripts/train.pyis the only caller and always passessplit_ratio=0.9or-0.1. It surfaces for any user constructingArrowDatasetdirectly with default arguments.Description
Initialize
self.start_file_idx = 0once before theif/elif/elseblock inArrowDataset.__init__. The existing assignments inside the non-trivial-split branches harmlessly override it, so behavior is unchanged on the current code paths.Single-line fix in
src/speculators/train/data.py.Related Issue
None — bug found while reading the code.
Tests
Added
test_arrow_dataset_default_split_ratio_does_not_crashintests/unit/train/test_data.py. The test constructs a realArrowDatasetwith the defaultsplit_ratio=1.0against a small on-disk fixture and asserts_map_to_file_idxreturns sensible values instead of raising.Verified locally:
```
pytest tests/unit/train/test_data.py -v
new test passes; existing tests still pass
ruff check src/speculators/train/data.py tests/unit/train/test_data.py
ruff format --check src/speculators/train/data.py tests/unit/train/test_data.py
all clean
```
I have filled in: