Skip to content

Commit 0d162d5

Browse files
authored
Optimize load() by avoiding redundant hash checks and unpacking (#1081)
### Description Pooch by default revalidates file hashes and re-unpacks archives on every call, which is very slow for large checkpoints. This change introduces a `.checked` marker file that stores the resolved resource path once verification succeeds. Subsequent calls reuse this cached path instead of repeating the expensive validation and extraction steps. Key changes: - Use a `.checked` file alongside the cached resource to record the verified path. - Load from the `.checked` file if it exists, bypassing re-validation. - Ensure `.checked` is written after successful retrieval/unpacking. ### Type of changes <!-- Mark the relevant option with an [x] --> - [ ] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels: - [SKIP_CI](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#skip_ci) - Skip all continuous integration tests - [INCLUDE_NOTEBOOKS_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_notebooks_tests) - Execute notebook validation tests in pytest - [INCLUDE_SLOW_TESTS](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/user-guide/contributing/contributing.md#include_slow_tests) - Execute tests labelled as slow in pytest for extensive testing > [!NOTE] > By default, the notebooks validation tests are skipped unless explicitly enabled. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. ### Usage <!--- How does a user interact with the changed code --> ```python # TODO: Add code snippet ``` ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - New Features - Cache-based early exit to reuse previously verified and unpacked checkpoints, avoiding redundant downloads. - Automatic unpacking/decompression during retrieval based on file type. - Performance - Faster subsequent loads by skipping repeated integrity checks and extraction on cache hits. - Refactor - Unified post-retrieval path handling across flows; no public API changes. - Chores - Added debug logs to indicate when cached paths are used for improved traceability. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Anton Vorontsov <avorontsov@nvidia.com>
1 parent a114094 commit 0d162d5

1 file changed

Lines changed: 16 additions & 4 deletions

File tree

  • sub-packages/bionemo-core/src/bionemo/core/data

sub-packages/bionemo-core/src/bionemo/core/data/load.py

Lines changed: 16 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -195,9 +195,19 @@ def load(
195195
else:
196196
raise ValueError(f"Source '{source}' not supported.")
197197

198+
# Pooch will keep checking hashes and unpacking archives for each call,
199+
# which is very time-consuming for large checkpoints. Instead, we make it
200+
# do it only once by marking the resource as fully checked.
201+
fname = f"{resource.sha256}-{filename}"
202+
checked = cache_dir / (fname + ".checked")
203+
if checked.exists():
204+
path = checked.read_text()
205+
logger.debug(f"Using cached {path=} from {checked=}")
206+
return Path(path)
207+
198208
download = pooch.retrieve(
199209
url=str(url),
200-
fname=f"{resource.sha256}-{filename}",
210+
fname=fname,
201211
known_hash=resource.sha256,
202212
path=cache_dir,
203213
downloader=download_fn,
@@ -207,10 +217,12 @@ def load(
207217
# Pooch by default returns a list of unpacked files if they unpack a zipped or tarred directory. Instead of that, we
208218
# just want the unpacked, parent folder.
209219
if isinstance(download, list):
210-
return Path(processor.extract_dir) # type: ignore
211-
220+
path = Path(processor.extract_dir) # type: ignore
212221
else:
213-
return Path(download)
222+
path = Path(download)
223+
224+
checked.write_text(str(path))
225+
return path
214226

215227

216228
def _get_processor(extension: str, unpack: bool | None, decompress: bool | None):

0 commit comments

Comments
 (0)