Skip to content

Guard corpus reingest source reads to prevent unbounded ZIP-member memory use#2090

Open
JSv4 wants to merge 2 commits into
mainfrom
codex/fix-default-corpus-reingest-vulnerability
Open

Guard corpus reingest source reads to prevent unbounded ZIP-member memory use#2090
JSv4 wants to merge 2 commits into
mainfrom
codex/fix-default-corpus-reingest-vulnerability

Conversation

@JSv4

@JSv4 JSv4 commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Motivation

  • User-facing V2 corpus-export imports defaulted to the reingest path which performed an unbounded ZipExtFile.read() of each ZIP member, allowing a crafted upload to force Celery workers to allocate memory proportional to an uncompressed member and crash or hang workers.
  • The change introduces a lightweight guard so reingest is only attempted when the member's uncompressed size is safely bounded, falling back to the baked-import path otherwise to preserve availability.

Description

  • Add a new setting MAX_CORPUS_REINGEST_SOURCE_BYTES (default 256 MiB) to bound the uncompressed per-document member size that will be read for reingest. (file: config/settings/base.py)
  • Implement _read_reingest_source_bytes(import_zip, doc_filename) which checks ZipInfo.file_size against the configured limit, reads at most limit+1 bytes and returns None on oversize so callers can fall back to the baked import; log warnings when skipping reingest. (file: opencontractserver/tasks/import_tasks_v2.py)
  • Replace the previous eager fh.read() reingest read with the guarded helper and only take the reingest path when the helper returns bytes and _source_is_reingestable(...) is true, otherwise fall back to the baked import path. (file: opencontractserver/tasks/import_tasks_v2.py)
  • Add focused unit tests exercising oversized and under-limit ZIP-member behavior for the new helper, and import override_settings where needed. (file: opencontractserver/tests/test_import_v2_reingest_remap.py)

Testing

  • Ran python -m py_compile opencontractserver/tasks/import_tasks_v2.py config/settings/base.py and it succeeded with no syntax errors.
  • Attempted to run the new unit subset with pytest opencontractserver/tests/test_import_v2_reingest_remap.py::TestSourceReingestability -q but the environment lacks Django and pytest failed with ModuleNotFoundError: No module named 'django' so the full test run could not be executed here.
  • The change is limited to a guarded read and fallback logic to preserve existing functionality while avoiding unbounded memory allocation.

Codex Task

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.66667% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
opencontractserver/tasks/import_tasks_v2.py 86.66% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

The second guard in _read_reingest_source_bytes (lines 113-120 of
import_tasks_v2.py) fires when ZIP metadata under-reports file_size so
the first guard passes, but fh.read(max+1) still returns more than the
limit. Covers the logger.warning + return None that Codecov flagged as
uncovered.

The test creates a real 10-byte stored ZIP entry, patches the ZipInfo
file_size to 3 in-place (getinfo() returns the live NameToInfo entry so
the function under test sees the modified value), then asserts None is
returned.
@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

Code Review

This PR addresses a real security concern: user-facing V2 corpus-export imports previously performed an unbounded ZipExtFile.read() on each document member, letting a crafted ZIP force Celery workers to allocate memory proportional to the uncompressed size. The fix adds a two-stage guard — a pre-read metadata check plus a post-read length check using a sentinel-byte read — and covers the logic with three focused unit tests.

The core guard mechanism is sound and the new setting is correctly wired through Django-environ. Two issues are worth addressing before merge.


opencontractserver/tasks/import_tasks_v2.py:363 — Misleading log message when the size guard fires

The compound condition if source_bytes is not None and _source_is_reingestable(source_bytes) has three possible outcomes, but the fallthrough path collapses two of them:

  • Case A: _read_reingest_source_bytes returns None — size guard rejected the member (inner logger.warning already fired)
  • Case B: _source_is_reingestable returns False — document is a genuine NUL placeholder

Both cases hit reingest_fallback = True and the same logger.info("... has no preserved source file (placeholder) ..."). An operator tailing at INFO level sees only "placeholder" for what was actually a size-limit rejection, making ZIP-bomb probing attempts invisible in normal logs.

Suggested fix: split the two branches to give each a distinct log message:

source_bytes = _read_reingest_source_bytes(import_zip, doc_filename)
if source_bytes is None:
    reingest_fallback = True
    logger.info(
        "Reingest: document %s skipped — source exceeds configured size limit; "
        "importing baked layer instead.",
        doc_filename,
    )
elif not _source_is_reingestable(source_bytes):
    reingest_fallback = True
    logger.info(
        "Reingest: document %s has no preserved source file (placeholder); "
        "importing baked layer instead.",
        doc_filename,
    )
else:
    return _reingest_document_with_deferred_remap(...)

opencontractserver/tests/test_import_v2_reingest_remap.py:141, 157, 185 — Repeated imports inside test method bodies (DRY / CLAUDE.md)

All three new test methods independently declare:

import io
import zipfile
from opencontractserver.tasks.import_tasks_v2 import _read_reingest_source_bytes

CLAUDE.md: "DRY — please always architect code for maximal dryness and always see if you can consolidate related code and remove duplicative code."

These belong at module level. The repeated BytesIO / ZipFile construction scaffold is also a candidate for a small shared helper on the class, which would make future edge-case tests shorter to write.


Minor observation — getattr fallback of 0

max_source_bytes = getattr(settings, "MAX_CORPUS_REINGEST_SOURCE_BYTES", 0)

0 is falsy, so if the setting were absent both if max_source_bytes and ... guards would be bypassed and fh.read(-1) (unbounded) would run. In practice this cannot happen because base.py always defines the setting with a 256 MiB default, so it is not an active risk. That said, 0 is the least safe possible fallback for a security guard; using settings.MAX_CORPUS_REINGEST_SOURCE_BYTES directly (no getattr) would loudly fail on misconfiguration rather than silently disabling the protection.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant