FEAT Add safe_extract_zip helper for remote dataset loaders#1957
Merged
Conversation
romanlutz
reviewed
Jun 10, 2026
bac6c33 to
1f1e792
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a defensive ZIP extraction utility and migrates remote dataset loaders to use it, mitigating Zip Slip and zip bomb risks from untrusted archives.
Changes:
- Introduce
pyrit.common.safe_extract.safe_extract_zipwith validation for paths, entry types, size caps, and compression ratio. - Replace
zipfile.ZipFile(...).extractall(...)usage in multiple remote dataset loaders withsafe_extract_zip. - Add unit tests covering traversal, absolute paths, symlinks/devices/FIFOs, size/count limits, compression ratio bombs, and malformed headers.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/common/test_safe_extract.py | Adds unit tests validating safe ZIP extraction behavior and failure modes. |
| pyrit/common/safe_extract.py | Implements safe ZIP extraction with member validation and caps. |
| pyrit/datasets/seed_datasets/remote/vlguard_dataset.py | Switches VLGuard ZIP extraction to safe_extract_zip. |
| pyrit/datasets/seed_datasets/remote/jailbreakv_28k_dataset.py | Switches JailBreakV-28K ZIP extraction to safe_extract_zip. |
| pyrit/datasets/seed_datasets/remote/figstep_dataset.py | Switches FigStep ZIP extraction to safe_extract_zip in async download flow. |
1f1e792 to
72f553a
Compare
Contributor
Author
|
hey @romanlutz sorry to ping, pushed the updates addressing your feedback. lmk when you get a sec 🙏 |
romanlutz
reviewed
Jun 27, 2026
Signed-off-by: francose <13445813+francose@users.noreply.github.com>
72f553a to
a2e7b8c
Compare
The full JailBreakV_28K.zip is a ~15 GiB image bundle (compression ratio ~1x), which exceeds safe_extract_zip's 5 GiB default total-size cap. On a fresh extraction this raised UnsafeArchiveError and broke the loader. Forward a raised max_total_size (32 GiB) at the call site; the per-file size, compression-ratio, and file-count caps still bound a tampered archive. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
romanlutz
approved these changes
Jun 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1879.
so the gap is that the 3 remote dataset loaders (figstep, vlguard, jailbreakv-28k) all call
zipfile.ZipFile.extractall()straight on whatever upstream zip they downloaded.extractalldoesn't check member paths or sizes or entry types, so if any of those upstream zips ever get tampered with you get classic Zip Slip — single entry named../../etc/cron.d/xand you've landed a cron job on whoever ran the loader. Symlink entries do the same thing without even needing...Adds a new helper
pyrit/common/safe_extract.pyand swaps the 3 call sites. The helper validates every member's metadata BEFORE writing anything to disk — if any member fails a check, nothing gets written. matters because otherwise a malicious zip with 99 valid entries + 1 bad one would leak 99 files before raising.6 checks, each kills a different attack class:
../etc/x)all caps are kwargs so individual callers can tighten them if they want.
note
remote_dataset_loader.pyalso has azipfile.ZipFile()block but it useszf.open(inner)to stream a known-name file into memory, never writes by member name. different code path, not a Zip Slip site, left it alone.16 tests in
tests/unit/common/test_safe_extract.pycovering happy path, dotdot traversal, absolute paths (unix and drive-letter), symlink / block / fifo entries, per-file bomb, total-size bomb, compression ratio bomb, file-count cap, no-partial-write guarantee, both bytes and path sources, dest auto-create, resolved-path return. all green locally.on scope — Ruslan suggested in the thread that we drop Py 3.10/3.11 and switch to tar (PEP 706 has the data filter built in), which honestly would be cleaner if we could, but Roman flagged 3.11 has plenty of runway left and the upstream datasets aren't ours to convert anyway. app-level defensive extraction is the only fix that covers all supported runtimes today, which is what this does.