Skip to content

FEAT Add safe_extract_zip helper for remote dataset loaders#1957

Merged
romanlutz merged 4 commits into
microsoft:mainfrom
francose:feat/safe-zip-extract
Jun 30, 2026
Merged

FEAT Add safe_extract_zip helper for remote dataset loaders#1957
romanlutz merged 4 commits into
microsoft:mainfrom
francose:feat/safe-zip-extract

Conversation

@francose

@francose francose commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Closes #1879.

so the gap is that the 3 remote dataset loaders (figstep, vlguard, jailbreakv-28k) all call zipfile.ZipFile.extractall() straight on whatever upstream zip they downloaded. extractall doesn't check member paths or sizes or entry types, so if any of those upstream zips ever get tampered with you get classic Zip Slip — single entry named ../../etc/cron.d/x and you've landed a cron job on whoever ran the loader. Symlink entries do the same thing without even needing ...

Adds a new helper pyrit/common/safe_extract.py and swaps the 3 call sites. The helper validates every member's metadata BEFORE writing anything to disk — if any member fails a check, nothing gets written. matters because otherwise a malicious zip with 99 valid entries + 1 bad one would leak 99 files before raising.

6 checks, each kills a different attack class:

  • realpath stays under destination → Zip Slip (../etc/x)
  • reject symlink / block / char / fifo / socket entries → symlink path escape
  • per-file size cap (default 1 GiB) → single-entry decompression DoS
  • total uncompressed size cap (default 5 GiB) → multi-file zip bomb
  • compression ratio cap (default 100×) → ratio bomb
  • file count cap (default 50,000) → inode / hash-table DoS

all caps are kwargs so individual callers can tighten them if they want.

note remote_dataset_loader.py also has a zipfile.ZipFile() block but it uses zf.open(inner) to stream a known-name file into memory, never writes by member name. different code path, not a Zip Slip site, left it alone.

16 tests in tests/unit/common/test_safe_extract.py covering happy path, dotdot traversal, absolute paths (unix and drive-letter), symlink / block / fifo entries, per-file bomb, total-size bomb, compression ratio bomb, file-count cap, no-partial-write guarantee, both bytes and path sources, dest auto-create, resolved-path return. all green locally.

on scope — Ruslan suggested in the thread that we drop Py 3.10/3.11 and switch to tar (PEP 706 has the data filter built in), which honestly would be cleaner if we could, but Roman flagged 3.11 has plenty of runway left and the upstream datasets aren't ours to convert anyway. app-level defensive extraction is the only fix that covers all supported runtimes today, which is what this does.

@romanlutz romanlutz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic! Thanks @francose!

Comment thread pyrit/datasets/seed_datasets/remote/jailbreakv_28k_dataset.py Outdated
Comment thread pyrit/datasets/seed_datasets/remote/vlguard_dataset.py Outdated
Comment thread pyrit/common/safe_extract.py
@romanlutz romanlutz self-assigned this Jun 10, 2026
@francose francose force-pushed the feat/safe-zip-extract branch from bac6c33 to 1f1e792 Compare June 10, 2026 12:37
@francose francose marked this pull request as ready for review June 10, 2026 13:12
Copilot AI review requested due to automatic review settings June 10, 2026 13:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a defensive ZIP extraction utility and migrates remote dataset loaders to use it, mitigating Zip Slip and zip bomb risks from untrusted archives.

Changes:

  • Introduce pyrit.common.safe_extract.safe_extract_zip with validation for paths, entry types, size caps, and compression ratio.
  • Replace zipfile.ZipFile(...).extractall(...) usage in multiple remote dataset loaders with safe_extract_zip.
  • Add unit tests covering traversal, absolute paths, symlinks/devices/FIFOs, size/count limits, compression ratio bombs, and malformed headers.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/common/test_safe_extract.py Adds unit tests validating safe ZIP extraction behavior and failure modes.
pyrit/common/safe_extract.py Implements safe ZIP extraction with member validation and caps.
pyrit/datasets/seed_datasets/remote/vlguard_dataset.py Switches VLGuard ZIP extraction to safe_extract_zip.
pyrit/datasets/seed_datasets/remote/jailbreakv_28k_dataset.py Switches JailBreakV-28K ZIP extraction to safe_extract_zip.
pyrit/datasets/seed_datasets/remote/figstep_dataset.py Switches FigStep ZIP extraction to safe_extract_zip in async download flow.

Comment thread pyrit/common/safe_extract.py Outdated
Comment thread pyrit/common/safe_extract.py Outdated
Comment thread pyrit/common/safe_extract.py Outdated
Comment thread pyrit/common/safe_extract.py Outdated
Comment thread tests/unit/common/test_safe_extract.py
@francose

Copy link
Copy Markdown
Contributor Author

hey @romanlutz sorry to ping, pushed the updates addressing your feedback. lmk when you get a sec 🙏

Comment thread pyrit/common/safe_extract.py
Signed-off-by: francose <13445813+francose@users.noreply.github.com>
@francose francose force-pushed the feat/safe-zip-extract branch from 72f553a to a2e7b8c Compare June 27, 2026 13:16
Copilot AI added 3 commits June 30, 2026 13:06
The full JailBreakV_28K.zip is a ~15 GiB image bundle (compression ratio ~1x), which exceeds safe_extract_zip's 5 GiB default total-size cap. On a fresh extraction this raised UnsafeArchiveError and broke the loader. Forward a raised max_total_size (32 GiB) at the call site; the per-file size, compression-ratio, and file-count caps still bound a tampered archive.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@romanlutz romanlutz enabled auto-merge June 30, 2026 23:09
@romanlutz romanlutz added this pull request to the merge queue Jun 30, 2026
Merged via the queue into microsoft:main with commit 855a10f Jun 30, 2026
53 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Secure ZIP Extraction for Remote Datasets

5 participants