Fix parquet loading crash from datasets version mismatch (#1140)

yeyu-nvidia · claude · web-flow · commit af2fe2480e61 · 2026-04-07T20:55:55.000Z
## Summary - When local parquet files contain HF `datasets` metadata written by a different library version, `load_dataset("parquet")` raises a `TypeError` during feature deserialization - Added a fallback that catches the `TypeError` and reads parquet files directly via PyArrow, bypassing the incompatible metadata ## Test plan - [ ] Run `specdec_bench` with EAGLE config against local parquet dataset files - [ ] Verify normal (compatible) parquet loading still works via the primary `load_dataset` path 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Bug Fixes** * Improved robustness of the parquet dataset loader by adding a safer fallback loading path and metadata handling to ensure reliable dataset reads across diverse environments. * **Chores** * Broadened the supported version range for the datasets dependency to increase compatibility and reduce installation friction.  --------- Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
diff --git a/examples/specdec_bench/requirements_speed.txt b/examples/specdec_bench/requirements_speed.txt
@@ -1,4 +1,4 @@
-datasets>=4.4.0,<5.0.0
+datasets>=3.1.0
 rich>=14.2.0
 seaborn>=0.13.2
 tiktoken>=0.12.0
diff --git a/examples/specdec_bench/specdec_bench/datasets/speed.py b/examples/specdec_bench/specdec_bench/datasets/speed.py
@@ -716,7 +716,27 @@ def _load_dataset(self, config_name_or_dataset_path: config_type | str) -> "Data
                 }
             else:
                 data_files = {"test": [str(config_name_or_dataset_path_path)]}
-            dataset = load_dataset("parquet", data_files=data_files, split="test")
+            try:
+                dataset = load_dataset("parquet", data_files=data_files, split="test")
+            except (TypeError, ValueError):
+                # Fallback: parquet metadata may be incompatible with the installed
+                # ``datasets`` version.  Read via PyArrow and convert directly.
+                import pyarrow
+                import pyarrow.parquet as pq
+                from datasets import Dataset as HFDataset
+
+                tables = [pq.read_table(f) for f in data_files["test"]]
+                table = pyarrow.concat_tables(tables) if len(tables) > 1 else tables[0]
+                # Strip HF metadata from the schema to avoid Feature parsing errors
+                schema = table.schema
+                if schema.metadata and b"huggingface" in schema.metadata:
+                    new_meta = {
+                        k: v
+                        for k, v in schema.metadata.items()
+                        if k != b"huggingface"
+                    }
+                    table = table.replace_schema_metadata(new_meta or None)
+                dataset = HFDataset(table)
         if self.num_samples is not None:
             dataset = dataset.select(range(self.num_samples))
         return dataset