Skip to content

Commit c4f29a5

Browse files
committed
Stop SOLR cache write storm: gate @with_solr_cache on env flag + back off on 5xx
Two related fixes for the cache-layer behaviour that's been flooding CI logs with "Failed to cache result: HTTP 500" and adding ~200-300ms per query while the SOLR vfb_json core has a broken Lucene index (`/var/solr/data/vfb_json/data/index/_3koa1.fdm: Input/output error` → `IndexWriter is closed`). 1) `@with_solr_cache` was firing unconditionally. The __init__.py guard only gated the second-layer `patch_vfbquery_with_caching()` patch, but the @with_solr_cache decorator is applied at module-import time to functions in vfb_queries.py (term_info, instances, templates, neurons_part_here, etc.) and was running its full cache write/read path even when VFBQUERY_CACHE_ENABLED=false. With caching disabled on a broken SOLR backend, every call still made the failing write attempt. Fix: respect the env flag inside the wrapper — if disabled, pop force_refresh (so the wrapped function doesn't see a stray kwarg it can't accept) and call straight through. 2) cache_result() backed off only on `Exception`. HTTP 5xx responses (Lucene IndexWriter closed, SOLR proxy 502/503) hit the `else:` branch which just logged and returned False — every subsequent call re-attempted the same write, hitting the same 5xx, costing 200-300ms each time and producing a multi-KB stack trace per call. Fix: treat any 5xx as cause to set _solr_disabled and start the same backoff window the exception path uses. 4xx still logs once but doesn't disable (likely a payload/config issue, not server-down). Net effect on the python-test workflow: - VFBQUERY_CACHE_ENABLED=false → no cache writes attempted at all - VFBQUERY_CACHE_ENABLED=true on a broken backend → one warning, then fast-fail for the backoff window Server-side: the broken Lucene segment on the SOLR vfb_json core is a separate sysadmin issue (filesystem I/O error on the SOLR host). This PR doesn't fix it, but it stops the failure from cascading into every test run. Refs: PR #41 CI logs showing the storm.
1 parent ef2b11b commit c4f29a5

1 file changed

Lines changed: 39 additions & 3 deletions

File tree

src/vfbquery/solr_result_cache.py

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -391,9 +391,33 @@ def cache_result(self, query_type: str, term_id: str, result: Any, **params) ->
391391
logger.info(f"Cached {query_type} for {term_id} as {cache_doc_id}, size: {cached_data['result_size']/1024:.1f}KB")
392392
return True
393393
else:
394-
logger.error(f"Failed to cache result: HTTP {response.status_code} - {response.text}")
394+
# Server-side failure (typical examples: 500 Lucene IndexWriter
395+
# closed after a disk I/O error; 502/503 SOLR proxy down). The
396+
# original code only logged here and kept retrying every call,
397+
# which floods the log and adds ~200-300ms per query for the
398+
# round-trip to fail. Treat any 5xx as cause to trip the same
399+
# backoff machinery the exception branch uses, so subsequent
400+
# cache_result() calls fast-fail via _solr_available().
401+
err = f"HTTP {response.status_code} - {response.text[:500]}"
402+
if response.status_code >= 500:
403+
self._solr_disabled = True
404+
self._solr_disabled_until = time.time() + self._solr_backoff_seconds
405+
if err != self._solr_last_error:
406+
logger.warning(
407+
"Solr cache write returned %d; disabling cache for %ds: %s",
408+
response.status_code,
409+
self._solr_backoff_seconds,
410+
err,
411+
)
412+
self._solr_last_error = err
413+
else:
414+
# 4xx — probably a payload / config issue. Don't disable
415+
# but log once.
416+
if err != self._solr_last_error:
417+
logger.error("Failed to cache result: %s", err)
418+
self._solr_last_error = err
395419
return False
396-
420+
397421
except Exception as e:
398422
# Mark Solr as temporarily unavailable to avoid repeated errors
399423
self._solr_disabled = True
@@ -766,9 +790,21 @@ def get_term_info(short_form, force_refresh=False, **kwargs):
766790
"""
767791
def decorator(func):
768792
def wrapper(*args, **kwargs):
793+
# Honour VFBQUERY_CACHE_ENABLED=false by bypassing the cache layer
794+
# entirely. The __init__.py only gates the second-layer
795+
# patch_vfbquery_with_caching() patch, but this @with_solr_cache
796+
# decorator is applied at module-import time and so was firing
797+
# unconditionally before this guard — flooding the log with HTTP
798+
# 500s and adding hundreds of ms per call when the cache backend
799+
# is down. Pop force_refresh either way so the wrapped function
800+
# doesn't see a stray kwarg it doesn't accept.
801+
if os.getenv('VFBQUERY_CACHE_ENABLED', 'true').lower() in ('false', '0', 'no', 'off'):
802+
kwargs.pop('force_refresh', None)
803+
return func(*args, **kwargs)
804+
769805
# Check if force_refresh is requested (pop it before passing to function)
770806
force_refresh = kwargs.pop('force_refresh', False)
771-
807+
772808
# Check if limit is applied - only cache full results (limit=-1)
773809
limit = kwargs.get('limit', -1)
774810
should_cache = (limit == -1) # Only cache when getting all results (limit=-1)

0 commit comments

Comments
 (0)