Fix AttributeError in ArrowDataset on default split_ratio by aman-source · Pull Request #528 · vllm-project/speculators

aman-source · 2026-05-18T06:09:55Z

Purpose

Fix a crash in ArrowDataset when constructed with the default split_ratio=1.0. _map_to_file_idx (and therefore __getitem__) reads self.start_file_idx unconditionally, but __init__ only sets that attribute inside the branches handling a non-trivial split. With split_ratio == 1.0, the attribute is never assigned and the first sample access raises AttributeError.

The bug is latent in current in-repo usage because scripts/train.py is the only caller and always passes split_ratio=0.9 or -0.1. It surfaces for any user constructing ArrowDataset directly with default arguments.

Description

Initialize self.start_file_idx = 0 once before the if/elif/else block in ArrowDataset.__init__. The existing assignments inside the non-trivial-split branches harmlessly override it, so behavior is unchanged on the current code paths.

Single-line fix in src/speculators/train/data.py.

Related Issue

None — bug found while reading the code.

Tests

Added test_arrow_dataset_default_split_ratio_does_not_crash in tests/unit/train/test_data.py. The test constructs a real ArrowDataset with the default split_ratio=1.0 against a small on-disk fixture and asserts _map_to_file_idx returns sensible values instead of raising.

Verified locally:

```
pytest tests/unit/train/test_data.py -v

new test passes; existing tests still pass

ruff check src/speculators/train/data.py tests/unit/train/test_data.py
ruff format --check src/speculators/train/data.py tests/unit/train/test_data.py

all clean

```

I have filled in:

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan/results, such as providing test command and pasting the results.
(Optional) The necessary documentation update.
I (a human) have written or reviewed the code in this pr to the best of my ability.

ArrowDataset.__init__ only set self.start_file_idx inside branches handling a non-trivial split. With split_ratio == 1.0 (the default), the attribute was never assigned and _map_to_file_idx raised AttributeError on the first access (e.g. ds[0]). Set start_file_idx = 0 before the if/elif/else so all branches produce a usable dataset; the existing assignments in the split branches harmlessly override. Signed-off-by: Aman Shaik <amanabdul21@gmail.com>

coderabbitai · 2026-05-18T06:10:07Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6e9f0b94-2aee-4e10-a8a5-f3339b625214

📥 Commits

Reviewing files that changed from the base of the PR and between 4f80f5d and e1c0730.

📒 Files selected for processing (2)

src/speculators/train/data.py
tests/unit/train/test_data.py

Summary by CodeRabbit

Bug Fixes
- Fixed a crash in ArrowDataset when initialized with default parameters.
Tests
- Added unit test to validate ArrowDataset indexing works correctly with default configuration settings.

Walkthrough

ArrowDataset is patched to initialize start_file_idx=0 during construction, ensuring _map_to_file_idx() remains callable when split_ratio is 1.0 (the default). A regression test validates the fix by instantiating the dataset with default settings and confirming index mapping succeeds.

Changes

Default split_ratio initialization fix

Layer / File(s)	Summary
Initialize start_file_idx in ArrowDataset `src/speculators/train/data.py`, `tests/unit/train/test_data.py`	`ArrowDataset.__init__` now sets `self.start_file_idx = 0` immediately after dataset loading. Test imports are extended to include `datasets.Dataset` and `ArrowDataset`. A new regression test `test_arrow_dataset_default_split_ratio_does_not_crash` validates that `ArrowDataset` with `split_ratio=1.0` and `on_missing="skip"` supports indexing via `_map_to_file_idx()` without raising `AttributeError`.

🎯 2 (Simple) | ⏱️ ~10 minutes

bug, training

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main fix: resolving an AttributeError in ArrowDataset when using the default split_ratio value.
Description check	✅ Passed	The description is directly related to the changeset, providing clear context about the bug, the fix, test coverage, and verification results.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

fynnsu · 2026-05-20T17:59:06Z

Thanks for the fix!

coderabbitai Bot added bug Something isn't working training labels May 18, 2026

fynnsu approved these changes May 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AttributeError in ArrowDataset on default split_ratio#528

Fix AttributeError in ArrowDataset on default split_ratio#528
aman-source wants to merge 1 commit into
vllm-project:mainfrom
aman-source:fix/arrow-dataset-default-split-ratio

aman-source commented May 18, 2026

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Uh oh!

fynnsu commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aman-source commented May 18, 2026

Purpose

Description

Related Issue

Tests

new test passes; existing tests still pass

all clean

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Uh oh!

fynnsu commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 18, 2026 •

edited

Loading