fix: download_all extracted data files to wrong path#231
Conversation
`download_all()` used `ZipFile.extract(member, path)`, which treats the second argument as a target directory and preserves the member's full archive path beneath it. Data files therefore landed at paths like `cache/foo.root/<repo-sha>/.../data/foo.root` (a directory named `foo.root`) instead of the flat `cache/foo.root` that `data_path()` resolves. Members are now written out flat with `shutil.copyfileobj`, and directory entries are skipped. Also add the `filter="data"` safety filter to the `tar.extractall` call in `fetch_remote_dataset`, which on Python <=3.13 otherwise emits a DeprecationWarning and uses unsafe extraction. A regression test for `download_all` is added (the suite previously had none, which is why the bug shipped). Refs #228 Assisted-by: ClaudeCode:claude-opus-4.8
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #231 +/- ##
==========================================
+ Coverage 73.43% 81.34% +7.90%
==========================================
Files 4 4
Lines 128 134 +6
==========================================
+ Hits 94 109 +15
+ Misses 34 25 -9 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
| logging.warning("Extracting %s", writefile) | ||
| with tarfile.open(str(writefile)) as tar: | ||
| members = [tar.getmember(f) for f in files.values()] | ||
| tar.extractall(str(dataset_dir), members) |
There was a problem hiding this comment.
This isn't supported on all patch releases before 3.12.0.
There was a problem hiding this comment.
🤖 AI text below 🤖
Good catch. The data filter only exists on patch releases that got the backport (e.g. 3.10.12, not 3.10.0). Pushed df8b23d: the filtered call is now gated to sys.version_info >= (3, 12) or a hasattr(tarfile, "data_filter") feature check, falling back to the unfiltered extractall otherwise.
The "data" extraction filter is only available on Python patch releases that received the backport (3.10.12, 3.11.4, ...), not earlier ones such as 3.10.0. Gate the filtered call to 3.12+ or a feature check, falling back to the unfiltered call otherwise. Assisted-by: ClaudeCode:claude-opus-4.8
The else branch only runs on old Python patch releases without the data filter backport, which CI does not exercise. Assisted-by: ClaudeCode:claude-opus-4.8
|
@henryiii, I've just made a release as it was necessary for some work in pylhe. I see you have this and 2 other improvements in the pipeline. Feel free to finalise and merge, as I'm getting very low on time and need to get to some other work - Scikit-HEP has taken a lot of my week ;-). I believe your 3 PRs can wait for new test files to make it to a new release, but you see what you think. Thank you. |
🤖 AI text below 🤖
Fixes two download-path bugs tracked in #228.
Fix 1 —
download_all()extracted to the wrong path (local_files.py)download_all()usedZipFile.extract(member, path), whose second argument is the target directory: the member's full archive path is preserved beneath it. Data files therefore landed at paths likecache/foo.root/<repo-sha>/.../data/foo.root(a directory literally namedfoo.root) instead of the flatcache/foo.rootthatdata_path()resolves. Each member is now written out flat withshutil.copyfileobj, and directory entries (names ending in/) are skipped so we never try to open a directory for writing.Fix 2 — missing
extractallsafety filter (remote_files.py)Added
filter="data"to thetar.extractall(...)call infetch_remote_dataset. Without it, Python <=3.13 emits aDeprecationWarningand falls back to unsafe extraction.Test
Added a
download_allregression test (the suite had none, which is why the bug shipped). It builds an in-memory zip mimicking the GitHub zipball layout, monkeypatchesrequests.get, and asserts the data file lands flat attmp_path/somefile.rootwith the right bytes while the.pymember and directory entry are ignored.Full suite +
prek -a(ruff/black/mypy strict) pass.Refs #228