Commit 289a239
fix: use data_dir for directory paths in ShardedDataset (#1301)
## Summary
- `datasets`' `resolve_pattern` only matches entries with
`type=="file"`, so passing a bare directory path as `data_files` to
`load_dataset` results in `FileNotFoundError` even when the directory
exists on disk
- Detect directory paths in `ShardedDataset._load_dataset()` and pass
them via `data_dir` instead of `data_files`
## Reproduction
```python
from datasets import load_dataset
# This fails with FileNotFoundError:
load_dataset("json", data_files="/path/to/data_directory")
# This works:
load_dataset("json", data_dir="/path/to/data_directory")
```
## Test plan
- [ ] Verify existing EAGLE3/DFlash training pipelines that pass
directory paths work
- [ ] Verify file path and glob patterns still work (falls through to
`data_files`)
- [ ] Verify `data_files=None` (no data_files arg) still works
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
## Bug Fixes
* Fixed an issue with dataset loading that prevented proper handling of
directory-based data sources. Directories are now correctly detected and
processed during dataset initialization.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>1 parent 97d1531 commit 289a239
1 file changed
Lines changed: 11 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
73 | 73 | | |
74 | 74 | | |
75 | 75 | | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
76 | 85 | | |
77 | 86 | | |
78 | 87 | | |
79 | | - | |
| 88 | + | |
| 89 | + | |
80 | 90 | | |
81 | 91 | | |
82 | 92 | | |
| |||
0 commit comments