Skip to content

Fix AttributeError in ArrowDataset on default split_ratio#528

Open
aman-source wants to merge 1 commit into
vllm-project:mainfrom
aman-source:fix/arrow-dataset-default-split-ratio
Open

Fix AttributeError in ArrowDataset on default split_ratio#528
aman-source wants to merge 1 commit into
vllm-project:mainfrom
aman-source:fix/arrow-dataset-default-split-ratio

Conversation

@aman-source
Copy link
Copy Markdown

Purpose

Fix a crash in ArrowDataset when constructed with the default split_ratio=1.0. _map_to_file_idx (and therefore __getitem__) reads self.start_file_idx unconditionally, but __init__ only sets that attribute inside the branches handling a non-trivial split. With split_ratio == 1.0, the attribute is never assigned and the first sample access raises AttributeError.

The bug is latent in current in-repo usage because scripts/train.py is the only caller and always passes split_ratio=0.9 or -0.1. It surfaces for any user constructing ArrowDataset directly with default arguments.

Description

Initialize self.start_file_idx = 0 once before the if/elif/else block in ArrowDataset.__init__. The existing assignments inside the non-trivial-split branches harmlessly override it, so behavior is unchanged on the current code paths.

Single-line fix in src/speculators/train/data.py.

Related Issue

None — bug found while reading the code.

Tests

Added test_arrow_dataset_default_split_ratio_does_not_crash in tests/unit/train/test_data.py. The test constructs a real ArrowDataset with the default split_ratio=1.0 against a small on-disk fixture and asserts _map_to_file_idx returns sensible values instead of raising.

Verified locally:

​```
pytest tests/unit/train/test_data.py -v

new test passes; existing tests still pass

ruff check src/speculators/train/data.py tests/unit/train/test_data.py
ruff format --check src/speculators/train/data.py tests/unit/train/test_data.py

all clean

​```

I have filled in:

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan/results, such as providing test command and pasting the results.
  • (Optional) The necessary documentation update.
  • I (a human) have written or reviewed the code in this pr to the best of my ability.

ArrowDataset.__init__ only set self.start_file_idx inside branches handling a non-trivial split. With split_ratio == 1.0 (the default), the attribute was never assigned and _map_to_file_idx raised AttributeError on the first access (e.g. ds[0]).

Set start_file_idx = 0 before the if/elif/else so all branches produce a usable dataset; the existing assignments in the split branches harmlessly override.

Signed-off-by: Aman Shaik <amanabdul21@gmail.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6e9f0b94-2aee-4e10-a8a5-f3339b625214

📥 Commits

Reviewing files that changed from the base of the PR and between 4f80f5d and e1c0730.

📒 Files selected for processing (2)
  • src/speculators/train/data.py
  • tests/unit/train/test_data.py

Summary by CodeRabbit

  • Bug Fixes

    • Fixed a crash in ArrowDataset when initialized with default parameters.
  • Tests

    • Added unit test to validate ArrowDataset indexing works correctly with default configuration settings.

Walkthrough

ArrowDataset is patched to initialize start_file_idx=0 during construction, ensuring _map_to_file_idx() remains callable when split_ratio is 1.0 (the default). A regression test validates the fix by instantiating the dataset with default settings and confirming index mapping succeeds.

Changes

Default split_ratio initialization fix

Layer / File(s) Summary
Initialize start_file_idx in ArrowDataset
src/speculators/train/data.py, tests/unit/train/test_data.py
ArrowDataset.__init__ now sets self.start_file_idx = 0 immediately after dataset loading. Test imports are extended to include datasets.Dataset and ArrowDataset. A new regression test test_arrow_dataset_default_split_ratio_does_not_crash validates that ArrowDataset with split_ratio=1.0 and on_missing="skip" supports indexing via _map_to_file_idx() without raising AttributeError.

🎯 2 (Simple) | ⏱️ ~10 minutes


bug, training

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main fix: resolving an AttributeError in ArrowDataset when using the default split_ratio value.
Description check ✅ Passed The description is directly related to the changeset, providing clear context about the bug, the fix, test coverage, and verification results.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added bug Something isn't working training labels May 18, 2026
@fynnsu
Copy link
Copy Markdown
Collaborator

fynnsu commented May 20, 2026

Thanks for the fix!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working training

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants