Stop SOLR cache write storm: gate @with_solr_cache on env flag + back off on 5xx

Robbie1977 · Robbie1977 · commit c4f29a5935db · 2026-05-27T06:03:53.000Z
Two related fixes for the cache-layer behaviour that's been flooding CI logs with "Failed to cache result: HTTP 500" and adding ~200-300ms per query while the SOLR vfb_json core has a broken Lucene index (`/var/solr/data/vfb_json/data/index/_3koa1.fdm: Input/output error` → `IndexWriter is closed`). 1) `@with_solr_cache` was firing unconditionally. The __init__.py guard only gated the second-layer `patch_vfbquery_with_caching()` patch, but the @with_solr_cache decorator is applied at module-import time to functions in vfb_queries.py (term_info, instances, templates, neurons_part_here, etc.) and was running its full cache write/read path even when VFBQUERY_CACHE_ENABLED=false. With caching disabled on a broken SOLR backend, every call still made the failing write attempt. Fix: respect the env flag inside the wrapper — if disabled, pop force_refresh (so the wrapped function doesn't see a stray kwarg it can't accept) and call straight through. 2) cache_result() backed off only on `Exception`. HTTP 5xx responses (Lucene IndexWriter closed, SOLR proxy 502/503) hit the `else:` branch which just logged and returned False — every subsequent call re-attempted the same write, hitting the same 5xx, costing 200-300ms each time and producing a multi-KB stack trace per call. Fix: treat any 5xx as cause to set _solr_disabled and start the same backoff window the exception path uses. 4xx still logs once but doesn't disable (likely a payload/config issue, not server-down). Net effect on the python-test workflow: - VFBQUERY_CACHE_ENABLED=false → no cache writes attempted at all - VFBQUERY_CACHE_ENABLED=true on a broken backend → one warning, then fast-fail for the backoff window Server-side: the broken Lucene segment on the SOLR vfb_json core is a separate sysadmin issue (filesystem I/O error on the SOLR host). This PR doesn't fix it, but it stops the failure from cascading into every test run. Refs: PR #41 CI logs showing the storm.
diff --git a/src/vfbquery/solr_result_cache.py b/src/vfbquery/solr_result_cache.py
@@ -391,9 +391,33 @@ def cache_result(self, query_type: str, term_id: str, result: Any, **params) ->
                 logger.info(f"Cached {query_type} for {term_id} as {cache_doc_id}, size: {cached_data['result_size']/1024:.1f}KB")
                 return True
             else:
-                logger.error(f"Failed to cache result: HTTP {response.status_code} - {response.text}")
+                # Server-side failure (typical examples: 500 Lucene IndexWriter
+                # closed after a disk I/O error; 502/503 SOLR proxy down). The
+                # original code only logged here and kept retrying every call,
+                # which floods the log and adds ~200-300ms per query for the
+                # round-trip to fail. Treat any 5xx as cause to trip the same
+                # backoff machinery the exception branch uses, so subsequent
+                # cache_result() calls fast-fail via _solr_available().
+                err = f"HTTP {response.status_code} - {response.text[:500]}"
+                if response.status_code >= 500:
+                    self._solr_disabled = True
+                    self._solr_disabled_until = time.time() + self._solr_backoff_seconds
+                    if err != self._solr_last_error:
+                        logger.warning(
+                            "Solr cache write returned %d; disabling cache for %ds: %s",
+                            response.status_code,
+                            self._solr_backoff_seconds,
+                            err,
+                        )
+                        self._solr_last_error = err
+                else:
+                    # 4xx — probably a payload / config issue. Don't disable
+                    # but log once.
+                    if err != self._solr_last_error:
+                        logger.error("Failed to cache result: %s", err)
+                        self._solr_last_error = err
                 return False
-                
+
         except Exception as e:
             # Mark Solr as temporarily unavailable to avoid repeated errors
             self._solr_disabled = True
@@ -766,9 +790,21 @@ def get_term_info(short_form, force_refresh=False, **kwargs):
     """
     def decorator(func):
         def wrapper(*args, **kwargs):
+            # Honour VFBQUERY_CACHE_ENABLED=false by bypassing the cache layer
+            # entirely. The __init__.py only gates the second-layer
+            # patch_vfbquery_with_caching() patch, but this @with_solr_cache
+            # decorator is applied at module-import time and so was firing
+            # unconditionally before this guard — flooding the log with HTTP
+            # 500s and adding hundreds of ms per call when the cache backend
+            # is down. Pop force_refresh either way so the wrapped function
+            # doesn't see a stray kwarg it doesn't accept.
+            if os.getenv('VFBQUERY_CACHE_ENABLED', 'true').lower() in ('false', '0', 'no', 'off'):
+                kwargs.pop('force_refresh', None)
+                return func(*args, **kwargs)
+
             # Check if force_refresh is requested (pop it before passing to function)
             force_refresh = kwargs.pop('force_refresh', False)
-            
+
             # Check if limit is applied - only cache full results (limit=-1)
             limit = kwargs.get('limit', -1)
             should_cache = (limit == -1)  # Only cache when getting all results (limit=-1)