Guard corpus reingest source reads to prevent unbounded ZIP-member memory use#2090
Guard corpus reingest source reads to prevent unbounded ZIP-member memory use#2090JSv4 wants to merge 2 commits into
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
The second guard in _read_reingest_source_bytes (lines 113-120 of import_tasks_v2.py) fires when ZIP metadata under-reports file_size so the first guard passes, but fh.read(max+1) still returns more than the limit. Covers the logger.warning + return None that Codecov flagged as uncovered. The test creates a real 10-byte stored ZIP entry, patches the ZipInfo file_size to 3 in-place (getinfo() returns the live NameToInfo entry so the function under test sees the modified value), then asserts None is returned.
Code ReviewThis PR addresses a real security concern: user-facing V2 corpus-export imports previously performed an unbounded The core guard mechanism is sound and the new setting is correctly wired through Django-environ. Two issues are worth addressing before merge.
|
Motivation
ZipExtFile.read()of each ZIP member, allowing a crafted upload to force Celery workers to allocate memory proportional to an uncompressed member and crash or hang workers.Description
MAX_CORPUS_REINGEST_SOURCE_BYTES(default 256 MiB) to bound the uncompressed per-document member size that will be read for reingest. (file:config/settings/base.py)_read_reingest_source_bytes(import_zip, doc_filename)which checksZipInfo.file_sizeagainst the configured limit, reads at mostlimit+1bytes and returnsNoneon oversize so callers can fall back to the baked import; log warnings when skipping reingest. (file:opencontractserver/tasks/import_tasks_v2.py)fh.read()reingest read with the guarded helper and only take the reingest path when the helper returns bytes and_source_is_reingestable(...)is true, otherwise fall back to the baked import path. (file:opencontractserver/tasks/import_tasks_v2.py)override_settingswhere needed. (file:opencontractserver/tests/test_import_v2_reingest_remap.py)Testing
python -m py_compile opencontractserver/tasks/import_tasks_v2.py config/settings/base.pyand it succeeded with no syntax errors.pytest opencontractserver/tests/test_import_v2_reingest_remap.py::TestSourceReingestability -qbut the environment lacks Django andpytestfailed withModuleNotFoundError: No module named 'django'so the full test run could not be executed here.Codex Task