Skip to content

fix: download_all extracted data files to wrong path#231

Open
henryiii wants to merge 4 commits into
mainfrom
fix/download-all-extract-path
Open

fix: download_all extracted data files to wrong path#231
henryiii wants to merge 4 commits into
mainfrom
fix/download-all-extract-path

Conversation

@henryiii

Copy link
Copy Markdown
Member

🤖 AI text below 🤖

Fixes two download-path bugs tracked in #228.

Fix 1 — download_all() extracted to the wrong path (local_files.py)

download_all() used ZipFile.extract(member, path), whose second argument is the target directory: the member's full archive path is preserved beneath it. Data files therefore landed at paths like cache/foo.root/<repo-sha>/.../data/foo.root (a directory literally named foo.root) instead of the flat cache/foo.root that data_path() resolves. Each member is now written out flat with shutil.copyfileobj, and directory entries (names ending in /) are skipped so we never try to open a directory for writing.

Fix 2 — missing extractall safety filter (remote_files.py)

Added filter="data" to the tar.extractall(...) call in fetch_remote_dataset. Without it, Python <=3.13 emits a DeprecationWarning and falls back to unsafe extraction.

Test

Added a download_all regression test (the suite had none, which is why the bug shipped). It builds an in-memory zip mimicking the GitHub zipball layout, monkeypatches requests.get, and asserts the data file lands flat at tmp_path/somefile.root with the right bytes while the .py member and directory entry are ignored.

Full suite + prek -a (ruff/black/mypy strict) pass.

Refs #228

`download_all()` used `ZipFile.extract(member, path)`, which treats the
second argument as a target directory and preserves the member's full
archive path beneath it. Data files therefore landed at paths like
`cache/foo.root/<repo-sha>/.../data/foo.root` (a directory named
`foo.root`) instead of the flat `cache/foo.root` that `data_path()`
resolves. Members are now written out flat with `shutil.copyfileobj`,
and directory entries are skipped.

Also add the `filter="data"` safety filter to the `tar.extractall` call
in `fetch_remote_dataset`, which on Python <=3.13 otherwise emits a
DeprecationWarning and uses unsafe extraction.

A regression test for `download_all` is added (the suite previously had
none, which is why the bug shipped).

Refs #228

Assisted-by: ClaudeCode:claude-opus-4.8
@codecov

codecov Bot commented Jun 13, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.77778% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.34%. Comparing base (010b526) to head (0f49d05).

Files with missing lines Patch % Lines
src/skhep_testdata/remote_files.py 33.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #231      +/-   ##
==========================================
+ Coverage   73.43%   81.34%   +7.90%     
==========================================
  Files           4        4              
  Lines         128      134       +6     
==========================================
+ Hits           94      109      +15     
+ Misses         34       25       -9     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

logging.warning("Extracting %s", writefile)
with tarfile.open(str(writefile)) as tar:
members = [tar.getmember(f) for f in files.values()]
tar.extractall(str(dataset_dir), members)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't supported on all patch releases before 3.12.0.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI text below 🤖

Good catch. The data filter only exists on patch releases that got the backport (e.g. 3.10.12, not 3.10.0). Pushed df8b23d: the filtered call is now gated to sys.version_info >= (3, 12) or a hasattr(tarfile, "data_filter") feature check, falling back to the unfiltered extractall otherwise.

henryiii added 2 commits June 17, 2026 17:27
The "data" extraction filter is only available on Python patch releases
that received the backport (3.10.12, 3.11.4, ...), not earlier ones such
as 3.10.0. Gate the filtered call to 3.12+ or a feature check, falling
back to the unfiltered call otherwise.

Assisted-by: ClaudeCode:claude-opus-4.8
The else branch only runs on old Python patch releases without the data
filter backport, which CI does not exercise.

Assisted-by: ClaudeCode:claude-opus-4.8
@henryiii henryiii marked this pull request as ready for review June 17, 2026 21:33
@eduardo-rodrigues

Copy link
Copy Markdown
Member

@henryiii, I've just made a release as it was necessary for some work in pylhe. I see you have this and 2 other improvements in the pipeline. Feel free to finalise and merge, as I'm getting very low on time and need to get to some other work - Scikit-HEP has taken a lot of my week ;-).

I believe your 3 PRs can wait for new test files to make it to a new release, but you see what you think.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants