Skip to content

Commit af2fe24

Browse files
yeyu-nvidiaclaude
andauthored
Fix parquet loading crash from datasets version mismatch (#1140)
## Summary - When local parquet files contain HF `datasets` metadata written by a different library version, `load_dataset("parquet")` raises a `TypeError` during feature deserialization - Added a fallback that catches the `TypeError` and reads parquet files directly via PyArrow, bypassing the incompatible metadata ## Test plan - [ ] Run `specdec_bench` with EAGLE config against local parquet dataset files - [ ] Verify normal (compatible) parquet loading still works via the primary `load_dataset` path 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved robustness of the parquet dataset loader by adding a safer fallback loading path and metadata handling to ensure reliable dataset reads across diverse environments. * **Chores** * Broadened the supported version range for the datasets dependency to increase compatibility and reduce installation friction. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent bdc04f1 commit af2fe24

2 files changed

Lines changed: 22 additions & 2 deletions

File tree

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
datasets>=4.4.0,<5.0.0
1+
datasets>=3.1.0
22
rich>=14.2.0
33
seaborn>=0.13.2
44
tiktoken>=0.12.0

examples/specdec_bench/specdec_bench/datasets/speed.py

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -716,7 +716,27 @@ def _load_dataset(self, config_name_or_dataset_path: config_type | str) -> "Data
716716
}
717717
else:
718718
data_files = {"test": [str(config_name_or_dataset_path_path)]}
719-
dataset = load_dataset("parquet", data_files=data_files, split="test")
719+
try:
720+
dataset = load_dataset("parquet", data_files=data_files, split="test")
721+
except (TypeError, ValueError):
722+
# Fallback: parquet metadata may be incompatible with the installed
723+
# ``datasets`` version. Read via PyArrow and convert directly.
724+
import pyarrow
725+
import pyarrow.parquet as pq
726+
from datasets import Dataset as HFDataset
727+
728+
tables = [pq.read_table(f) for f in data_files["test"]]
729+
table = pyarrow.concat_tables(tables) if len(tables) > 1 else tables[0]
730+
# Strip HF metadata from the schema to avoid Feature parsing errors
731+
schema = table.schema
732+
if schema.metadata and b"huggingface" in schema.metadata:
733+
new_meta = {
734+
k: v
735+
for k, v in schema.metadata.items()
736+
if k != b"huggingface"
737+
}
738+
table = table.replace_schema_metadata(new_meta or None)
739+
dataset = HFDataset(table)
720740
if self.num_samples is not None:
721741
dataset = dataset.select(range(self.num_samples))
722742
return dataset

0 commit comments

Comments
 (0)