Gzip-compress Solr cache payloads; cap on compressed size (100 MB) by Robbie1977 · Pull Request #49 · VirtualFlyBrain/VFBquery

Robbie1977 · 2026-06-16T12:30:06Z

Large results (e.g. AllAlignedImages for whole-brain templates) serialised to hundreds of MB of JSON, exceeded the 10 MB cap, and were never cached - so they were recomputed on every call. Gzip+base64 the stored envelope (~10-15x smaller on the wire), enforce the size cap on the compressed payload, and raise the default cap to 100 MB (env: VFBQUERY_MAX_RESULT_MB). Reads transparently handle legacy plain-JSON and compressed entries; version bump invalidates stale ones.

Copilot

Pull request overview

This PR improves the Solr-backed result cache by storing cache envelopes as gz:-prefixed base64(gzip(JSON)) strings, enforcing the size cap on the stored compressed payload (defaulting to 100 MB via VFBQUERY_MAX_RESULT_MB), while maintaining backwards-compatible reads for legacy plain-JSON entries and bumping the package version to invalidate stale cache entries.

Changes:

Add gzip+base64 encoding for cache_data plus transparent decoding for legacy/plain entries.
Move size-cap enforcement to the compressed payload at write time; default cap raised to 100 MB and made env-configurable.
Add unit tests for encode/decode behavior and the new cap/metadata behavior; bump version to 1.22.0.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`src/vfbquery/solr_result_cache.py`	Adds gzip/base64 cache payload encoding/decoding and enforces size limits on the compressed stored value.
`tests/test_gzip_cache.py`	Adds unit tests for compression roundtrip, legacy decode handling, and updated size-cap/metadata behavior.
`src/vfbquery/_version.py`	Bumps version to invalidate older cached entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if isinstance(cached_field, str) and cached_field.startswith(_CACHE_GZIP_PREFIX):
+        blob = base64.b64decode(cached_field[len(_CACHE_GZIP_PREFIX):])
+        return gzip.decompress(blob).decode("utf-8")
+    return cached_field


+def test_cap_is_on_compressed_size():
+    c = SolrResultCache(max_result_size_mb=100)
+    assert c.max_result_size_bytes == 100 * 1024 * 1024
+    big = {"result": {"rows": [{"id": i, "name": "n"} for i in range(300000)]},
+           "cached_at": "2026-01-01T00:00:00+00:00",
+           "expires_at": "2026-04-01T00:00:00+00:00", "result_size": 0}
+    enc = _encode_cache_field(json.dumps(big))
+    assert len(enc.encode("utf-8")) < c.max_result_size_bytes


Clare72

Claude says it will fix the issue

Clare72 · 2026-06-16T13:14:02Z

btw, you don't need to manually update the version. It may make performance checks slow due to recomputes on version mismatch.

…nual version bump - _decode_cache_field: catch base64/gzip/unicode errors on corrupt gz: payloads and return the raw string, so callers treat it as invalid JSON (and purge it) instead of aborting cleanup/stats runs. (Copilot) - test: prove the cap is enforced on the COMPRESSED payload (raw > cap, compressed < cap) with a small cap and a highly compressible result. (Copilot) - revert manual _version.py bump (1.22.0 -> 1.21.0); the release workflow owns the version, and a manual mismatch forces cache recomputes. Legacy plain-JSON entries are still read transparently, so no invalidation is needed. (Clare)

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

            # Parse the cached metadata and result
-            cached_data = json.loads(cached_field)
+            cached_data = json.loads(_decode_cache_field(cached_field))



+            # Corrupt/truncated gz payload: return the raw string so callers'
+            # json.loads fails and the entry is treated as invalid (and purged),
+            # rather than raising an un-caught error that aborts cleanup/stats.
+            logger.warning("Failed to decode compressed cache payload; treating as invalid")


…racy - get_cached_result: explicitly catch decode/JSON errors from a corrupt or truncated gz: entry, purge it, and return None so it repopulates on the next call (was swallowed by the outer handler -> permanent miss). (Copilot) - _decode_cache_field: reword the comment so it no longer implies the function itself purges (the caller does), and fix "un-caught" -> "uncaught". (Copilot) - test: a corrupt gz: payload decodes to the raw string without raising.

Clare72 · 2026-06-16T13:37:17Z

Review: looks good — one CI gap to fix

Overall this is a clean, well-iterated fix for the design choice we're keeping (large results stay full-cardinality, just compressed). Nice work on the second commit's defensive decode.

Strengths

Compression approach is sound: gzip(6)+base64 with a gz: marker, and the cap is correctly enforced on the compressed stored_size in cache_result while _create_cache_metadata no longer rejects on raw size.
Backward-compatible: a JSON envelope always starts with {, so it can't collide with the gz: prefix — legacy plain-JSON entries decode unchanged and coexist during rollout.
Defensive decode: a corrupt/truncated gz: blob is caught and returned raw, so json.loads fails → entry treated as invalid/purged rather than crashing cleanup/stats.
cleanup_expired_entries reads expires_at from the Solr doc field first (if "expires_at" in doc) and only decodes cache_data as a fallback, so it doesn't decompress every entry just to check expiry.
Write timeout bumped 30→60s for the larger writes; manual version bump dropped (still 1.21.0 — let the release flow own it).

One real issue: the new tests won't run in CI

test_gzip_cache.py is in a new top-level tests/ dir, but this repo keeps tests in src/test/ and CI invokes specific files there (python-test.yml → src/test/term_info_queries_test.py; performance-test.yml → named src/test/* files). No workflow runs tests/ or a bare pytest, so these tests will never execute in CI.

Suggested fix: move to src/test/test_gzip_cache.py and wire it into a CI step — e.g. a pytest src/test -m "not integration" step (which would also start running test_graph_builder.py, currently unrun). Just relocating it isn't enough on its own; a step has to target it.

Minor nits (non-blocking)

The cache_result None-guard comment (# Result too large or other issue) is now stale — size no longer yields None from _create_cache_metadata.
Default cap is 100 MB compressed (~1–1.5 GB raw). Fine for the ~30 MB AllAlignedImages case, but a near-cap entry decompresses to ~1 GB in memory per read (OOM risk) and makes a ~100 MB Solr /update POST. A more conservative cap (30–50 MB) or a comment noting the ceiling would be safer.
The corrupt-gz: fallback path isn't unit-tested (one-liner to add).

Approve once the test is relocated + actually run in CI.

Clare72 · 2026-06-16T13:38:04Z

^ from claude

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Robbie1977 requested review from Clare72 and Copilot June 16, 2026 12:30

Copilot started reviewing on behalf of Robbie1977 June 16, 2026 12:30 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Clare72 approved these changes Jun 16, 2026

View reviewed changes

Robbie1977 requested a review from Copilot June 16, 2026 13:20

Copilot started reviewing on behalf of Robbie1977 June 16, 2026 13:21 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Robbie1977 requested a review from Copilot June 16, 2026 13:36

Copilot started reviewing on behalf of Robbie1977 June 16, 2026 13:37 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Comment thread src/vfbquery/solr_result_cache.py

Comment thread src/vfbquery/solr_result_cache.py Outdated

Robbie1977 and others added 2 commits June 16, 2026 14:40

Potential fix for pull request finding

0711ff3

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Potential fix for pull request finding

ac30b35

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Robbie1977 merged commit 47c6558 into main Jun 16, 2026
0 of 4 checks passed

Clare72 deleted the feature/gzip-result-cache branch June 16, 2026 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gzip-compress Solr cache payloads; cap on compressed size (100 MB)#49

Gzip-compress Solr cache payloads; cap on compressed size (100 MB)#49
Robbie1977 merged 5 commits into
mainfrom
feature/gzip-result-cache

Robbie1977 commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Clare72 left a comment

Uh oh!

Clare72 commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Clare72 commented Jun 16, 2026

Uh oh!

Clare72 commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Robbie1977 commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Clare72 left a comment

Choose a reason for hiding this comment

Uh oh!

Clare72 commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Clare72 commented Jun 16, 2026

Review: looks good — one CI gap to fix

Strengths

One real issue: the new tests won't run in CI

Minor nits (non-blocking)

Uh oh!

Clare72 commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants