Gzip-compress Solr cache payloads; cap on compressed size (100 MB)#49
Conversation
Large results (e.g. AllAlignedImages for whole-brain templates) serialised to hundreds of MB of JSON, exceeded the 10 MB cap, and were never cached - so they were recomputed on every call. Gzip+base64 the stored envelope (~10-15x smaller on the wire), enforce the size cap on the compressed payload, and raise the default cap to 100 MB (env: VFBQUERY_MAX_RESULT_MB). Reads transparently handle legacy plain-JSON and compressed entries; version bump invalidates stale ones.
There was a problem hiding this comment.
Pull request overview
This PR improves the Solr-backed result cache by storing cache envelopes as gz:-prefixed base64(gzip(JSON)) strings, enforcing the size cap on the stored compressed payload (defaulting to 100 MB via VFBQUERY_MAX_RESULT_MB), while maintaining backwards-compatible reads for legacy plain-JSON entries and bumping the package version to invalidate stale cache entries.
Changes:
- Add gzip+base64 encoding for
cache_dataplus transparent decoding for legacy/plain entries. - Move size-cap enforcement to the compressed payload at write time; default cap raised to 100 MB and made env-configurable.
- Add unit tests for encode/decode behavior and the new cap/metadata behavior; bump version to
1.22.0.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/vfbquery/solr_result_cache.py |
Adds gzip/base64 cache payload encoding/decoding and enforces size limits on the compressed stored value. |
tests/test_gzip_cache.py |
Adds unit tests for compression roundtrip, legacy decode handling, and updated size-cap/metadata behavior. |
src/vfbquery/_version.py |
Bumps version to invalidate older cached entries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if isinstance(cached_field, str) and cached_field.startswith(_CACHE_GZIP_PREFIX): | ||
| blob = base64.b64decode(cached_field[len(_CACHE_GZIP_PREFIX):]) | ||
| return gzip.decompress(blob).decode("utf-8") | ||
| return cached_field |
| def test_cap_is_on_compressed_size(): | ||
| c = SolrResultCache(max_result_size_mb=100) | ||
| assert c.max_result_size_bytes == 100 * 1024 * 1024 | ||
| big = {"result": {"rows": [{"id": i, "name": "n"} for i in range(300000)]}, | ||
| "cached_at": "2026-01-01T00:00:00+00:00", | ||
| "expires_at": "2026-04-01T00:00:00+00:00", "result_size": 0} | ||
| enc = _encode_cache_field(json.dumps(big)) | ||
| assert len(enc.encode("utf-8")) < c.max_result_size_bytes |
Clare72
left a comment
There was a problem hiding this comment.
Claude says it will fix the issue
|
btw, you don't need to manually update the version. It may make performance checks slow due to recomputes on version mismatch. |
…nual version bump - _decode_cache_field: catch base64/gzip/unicode errors on corrupt gz: payloads and return the raw string, so callers treat it as invalid JSON (and purge it) instead of aborting cleanup/stats runs. (Copilot) - test: prove the cap is enforced on the COMPRESSED payload (raw > cap, compressed < cap) with a small cap and a highly compressible result. (Copilot) - revert manual _version.py bump (1.22.0 -> 1.21.0); the release workflow owns the version, and a manual mismatch forces cache recomputes. Legacy plain-JSON entries are still read transparently, so no invalidation is needed. (Clare)
| # Parse the cached metadata and result | ||
| cached_data = json.loads(cached_field) | ||
| cached_data = json.loads(_decode_cache_field(cached_field)) | ||
|
|
| # Corrupt/truncated gz payload: return the raw string so callers' | ||
| # json.loads fails and the entry is treated as invalid (and purged), | ||
| # rather than raising an un-caught error that aborts cleanup/stats. | ||
| logger.warning("Failed to decode compressed cache payload; treating as invalid") |
…racy - get_cached_result: explicitly catch decode/JSON errors from a corrupt or truncated gz: entry, purge it, and return None so it repopulates on the next call (was swallowed by the outer handler -> permanent miss). (Copilot) - _decode_cache_field: reword the comment so it no longer implies the function itself purges (the caller does), and fix "un-caught" -> "uncaught". (Copilot) - test: a corrupt gz: payload decodes to the raw string without raising.
Review: looks good — one CI gap to fixOverall this is a clean, well-iterated fix for the design choice we're keeping (large results stay full-cardinality, just compressed). Nice work on the second commit's defensive decode. Strengths
One real issue: the new tests won't run in CI
Suggested fix: move to Minor nits (non-blocking)
Approve once the test is relocated + actually run in CI. |
|
^ from claude |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Large results (e.g. AllAlignedImages for whole-brain templates) serialised to hundreds of MB of JSON, exceeded the 10 MB cap, and were never cached - so they were recomputed on every call. Gzip+base64 the stored envelope (~10-15x smaller on the wire), enforce the size cap on the compressed payload, and raise the default cap to 100 MB (env: VFBQUERY_MAX_RESULT_MB). Reads transparently handle legacy plain-JSON and compressed entries; version bump invalidates stale ones.