Skip to content

Gzip-compress Solr cache payloads; cap on compressed size (100 MB)#49

Merged
Robbie1977 merged 5 commits into
mainfrom
feature/gzip-result-cache
Jun 16, 2026
Merged

Gzip-compress Solr cache payloads; cap on compressed size (100 MB)#49
Robbie1977 merged 5 commits into
mainfrom
feature/gzip-result-cache

Conversation

@Robbie1977

Copy link
Copy Markdown
Contributor

Large results (e.g. AllAlignedImages for whole-brain templates) serialised to hundreds of MB of JSON, exceeded the 10 MB cap, and were never cached - so they were recomputed on every call. Gzip+base64 the stored envelope (~10-15x smaller on the wire), enforce the size cap on the compressed payload, and raise the default cap to 100 MB (env: VFBQUERY_MAX_RESULT_MB). Reads transparently handle legacy plain-JSON and compressed entries; version bump invalidates stale ones.

Large results (e.g. AllAlignedImages for whole-brain templates) serialised to
hundreds of MB of JSON, exceeded the 10 MB cap, and were never cached - so they
were recomputed on every call. Gzip+base64 the stored envelope (~10-15x smaller
on the wire), enforce the size cap on the compressed payload, and raise the
default cap to 100 MB (env: VFBQUERY_MAX_RESULT_MB). Reads transparently handle
legacy plain-JSON and compressed entries; version bump invalidates stale ones.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the Solr-backed result cache by storing cache envelopes as gz:-prefixed base64(gzip(JSON)) strings, enforcing the size cap on the stored compressed payload (defaulting to 100 MB via VFBQUERY_MAX_RESULT_MB), while maintaining backwards-compatible reads for legacy plain-JSON entries and bumping the package version to invalidate stale cache entries.

Changes:

  • Add gzip+base64 encoding for cache_data plus transparent decoding for legacy/plain entries.
  • Move size-cap enforcement to the compressed payload at write time; default cap raised to 100 MB and made env-configurable.
  • Add unit tests for encode/decode behavior and the new cap/metadata behavior; bump version to 1.22.0.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/vfbquery/solr_result_cache.py Adds gzip/base64 cache payload encoding/decoding and enforces size limits on the compressed stored value.
tests/test_gzip_cache.py Adds unit tests for compression roundtrip, legacy decode handling, and updated size-cap/metadata behavior.
src/vfbquery/_version.py Bumps version to invalidate older cached entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +66 to +69
if isinstance(cached_field, str) and cached_field.startswith(_CACHE_GZIP_PREFIX):
blob = base64.b64decode(cached_field[len(_CACHE_GZIP_PREFIX):])
return gzip.decompress(blob).decode("utf-8")
return cached_field
Comment thread tests/test_gzip_cache.py Outdated
Comment on lines +24 to +31
def test_cap_is_on_compressed_size():
c = SolrResultCache(max_result_size_mb=100)
assert c.max_result_size_bytes == 100 * 1024 * 1024
big = {"result": {"rows": [{"id": i, "name": "n"} for i in range(300000)]},
"cached_at": "2026-01-01T00:00:00+00:00",
"expires_at": "2026-04-01T00:00:00+00:00", "result_size": 0}
enc = _encode_cache_field(json.dumps(big))
assert len(enc.encode("utf-8")) < c.max_result_size_bytes

@Clare72 Clare72 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude says it will fix the issue

@Clare72

Clare72 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

btw, you don't need to manually update the version. It may make performance checks slow due to recomputes on version mismatch.

…nual version bump

- _decode_cache_field: catch base64/gzip/unicode errors on corrupt gz: payloads
  and return the raw string, so callers treat it as invalid JSON (and purge it)
  instead of aborting cleanup/stats runs. (Copilot)
- test: prove the cap is enforced on the COMPRESSED payload (raw > cap,
  compressed < cap) with a small cap and a highly compressible result. (Copilot)
- revert manual _version.py bump (1.22.0 -> 1.21.0); the release workflow owns the
  version, and a manual mismatch forces cache recomputes. Legacy plain-JSON
  entries are still read transparently, so no invalidation is needed. (Clare)

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread src/vfbquery/solr_result_cache.py Outdated
Comment on lines 312 to 314
# Parse the cached metadata and result
cached_data = json.loads(cached_field)
cached_data = json.loads(_decode_cache_field(cached_field))

Comment thread src/vfbquery/solr_result_cache.py Outdated
Comment on lines +71 to +74
# Corrupt/truncated gz payload: return the raw string so callers'
# json.loads fails and the entry is treated as invalid (and purged),
# rather than raising an un-caught error that aborts cleanup/stats.
logger.warning("Failed to decode compressed cache payload; treating as invalid")
…racy

- get_cached_result: explicitly catch decode/JSON errors from a corrupt or
  truncated gz: entry, purge it, and return None so it repopulates on the next
  call (was swallowed by the outer handler -> permanent miss). (Copilot)
- _decode_cache_field: reword the comment so it no longer implies the function
  itself purges (the caller does), and fix "un-caught" -> "uncaught". (Copilot)
- test: a corrupt gz: payload decodes to the raw string without raising.
@Robbie1977 Robbie1977 requested a review from Copilot June 16, 2026 13:36
@Clare72

Clare72 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review: looks good — one CI gap to fix

Overall this is a clean, well-iterated fix for the design choice we're keeping (large results stay full-cardinality, just compressed). Nice work on the second commit's defensive decode.

Strengths

  • Compression approach is sound: gzip(6)+base64 with a gz: marker, and the cap is correctly enforced on the compressed stored_size in cache_result while _create_cache_metadata no longer rejects on raw size.
  • Backward-compatible: a JSON envelope always starts with {, so it can't collide with the gz: prefix — legacy plain-JSON entries decode unchanged and coexist during rollout.
  • Defensive decode: a corrupt/truncated gz: blob is caught and returned raw, so json.loads fails → entry treated as invalid/purged rather than crashing cleanup/stats.
  • cleanup_expired_entries reads expires_at from the Solr doc field first (if "expires_at" in doc) and only decodes cache_data as a fallback, so it doesn't decompress every entry just to check expiry.
  • Write timeout bumped 30→60s for the larger writes; manual version bump dropped (still 1.21.0 — let the release flow own it).

One real issue: the new tests won't run in CI

test_gzip_cache.py is in a new top-level tests/ dir, but this repo keeps tests in src/test/ and CI invokes specific files there (python-test.ymlsrc/test/term_info_queries_test.py; performance-test.yml → named src/test/* files). No workflow runs tests/ or a bare pytest, so these tests will never execute in CI.

Suggested fix: move to src/test/test_gzip_cache.py and wire it into a CI step — e.g. a pytest src/test -m "not integration" step (which would also start running test_graph_builder.py, currently unrun). Just relocating it isn't enough on its own; a step has to target it.

Minor nits (non-blocking)

  • The cache_result None-guard comment (# Result too large or other issue) is now stale — size no longer yields None from _create_cache_metadata.
  • Default cap is 100 MB compressed (~1–1.5 GB raw). Fine for the ~30 MB AllAlignedImages case, but a near-cap entry decompresses to ~1 GB in memory per read (OOM risk) and makes a ~100 MB Solr /update POST. A more conservative cap (30–50 MB) or a comment noting the ceiling would be safer.
  • The corrupt-gz: fallback path isn't unit-tested (one-liner to add).

Approve once the test is relocated + actually run in CI.

@Clare72

Clare72 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

^ from claude

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comment thread src/vfbquery/solr_result_cache.py
Comment thread src/vfbquery/solr_result_cache.py Outdated
Robbie1977 and others added 2 commits June 16, 2026 14:40
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@Robbie1977 Robbie1977 merged commit 47c6558 into main Jun 16, 2026
0 of 4 checks passed
@Clare72 Clare72 deleted the feature/gzip-result-cache branch June 16, 2026 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants