OLS-3221 Add PostgreSQL auto-recovery after DB restart#2964
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe PR adds Postgres cache health tracking and timeout settings, updates connection and cache error handling, and changes ChangesPostgres health and liveness
Estimated code review effort: 4 (Complex) | ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (1)
tests/unit/cache/test_postgres_cache.py (1)
836-865: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy liftExercise the health-check behavior instead of assigning its result.
These tests directly set
_health_statusand_consecutive_failures, so they would still pass if_health_check_loop()stopped updating either field. Extract a single-iteration health-check helper or inject a controllable sleep/event so the tests drive the success and failure paths. As per coding guidelines,tests/**/*.py: Assert specific values and behaviors in tests, not just that code runs without error.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/cache/test_postgres_cache.py` around lines 836 - 865, The health-check tests are bypassing the real behavior by mutating PostgresCache internals directly, so they won’t catch regressions in _health_check_loop(). Update the tests to drive the loop logic through a single-iteration helper or a controllable sleep/event on PostgresCache, then assert the resulting _health_status and _consecutive_failures values after simulated success/failure. Use the existing PostgresCache, _health_check_loop, _mark_unhealthy, and ready() symbols to locate the behavior and keep the assertions focused on observable state changes rather than direct field assignment.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ols/app/endpoints/health.py`:
- Around line 126-129: The 503 liveness response in the health endpoint is
returning an HTTPException shape instead of the declared LivenessResponse
contract. Update the liveness failure path in the health endpoint so the 503
body matches LivenessResponse (for example, by returning that model directly),
or change the documented 503 response model to the actual error payload; keep
the response schema and the OpenAPI declaration in sync around the liveness
handler.
In `@ols/app/models/config.py`:
- Around line 801-803: The config model currently accepts negative values for
statement_timeout, lock_timeout, and health_check_interval, so tighten
validation in the Pydantic config class that defines these fields. Add explicit
field constraints or validators on statement_timeout and lock_timeout to reject
negative integers, and make health_check_interval strictly positive so the
health-check loop and PostgresCache lock acquisition keep their timing
guarantees. Use the existing config model/field definitions in Config to enforce
these contracts at load time rather than downstream.
- Around line 1166-1167: `OLSConfig.__init__` is currently taking
`liveness_db_failure_threshold` directly from `data.get(...)`, so it can remain
a string or accept invalid values. Parse this field as an int when loading
config, and validate that the value is at least 1; if it is missing,
non-numeric, zero, or negative, raise `checks.InvalidConfigurationError` before
the value is used by `/liveness`.
In `@ols/src/cache/postgres_cache.py`:
- Around line 159-165: Validate the Postgres cache timing settings before they
reach the daemon: `health_check_interval` must not allow negative values or
zero-hot-loop behavior, and `lock_timeout` should reject negative values unless
indefinite blocking is explicitly intended. Add Pydantic validation in
`PostgresConfig` (using appropriate constrained types such as
non-negative/positive ints) and keep the runtime consumers in `PostgresCache`
safe by relying on those validated values in the initialization and health-check
logic. Update any related uses in `PostgresCache` and its config wiring so the
daemon never calls `time.sleep()` or lock acquisition with invalid values.
- Around line 189-219: The health-check path in _connect_health and
_health_check_loop can block indefinitely because neither the psycopg2.connect
call nor the SELECT 1 probe has a timeout. Update the PostgresCache health
connection setup to pass an explicit connect timeout in connect_kwargs and add a
statement timeout before executing the cursor probe, so ready() does not wait on
a hung network or server and _health_status can recover promptly.
- Around line 194-195: The broad exception handling in PostgresCache needs to be
narrowed to satisfy Ruff BLE001. Update the catch blocks in _connect_health,
_health_check_loop, _safe_rollback, and _safe_set_autocommit to handle only the
specific psycopg2 or attribute-related exceptions they can actually raise, and
keep the existing logger/debug behavior in those paths. If any catch-all is
intentionally required to keep the daemon alive, add an explicit # noqa: BLE001
on that except clause rather than leaving a bare except Exception.
In `@ols/utils/postgres.py`:
- Around line 106-108: The new cursor in the statement-timeout setup can leak if
`cursor.execute` fails during schema initialization. Update the `Postgres`
initialization flow around the `cursor = self.connection.cursor()` and `SET
statement_timeout` call to use a `try/finally` so `cursor.close()` always runs,
even on exceptions; keep the fix localized to this cursor-handling block.
- Around line 29-45: The wrapper in postgres connection handling is calling
connectable.connect() before the try/except, so a failed initial connect skips
the same recovery path as query failures. Move the initial connect() call into
the connection-error handling flow inside wrapper() so both the first connect
and the retry go through the same psycopg2.OperationalError/InterfaceError
handling, including _mark_unhealthy() and CacheError wrapping in the reconnect
path.
- Around line 33-45: The `@connection` retry logic in the postgres decorator is
causing duplicate writes for methods like PostgresCache.insert_or_append because
the operation may have already committed before reconnect and retry. Update the
retry behavior in the connection wrapper to avoid automatically retrying
write/mutating methods after OperationalError/InterfaceError, and keep the retry
path only for safe read operations or explicitly idempotent calls identified by
the wrapped function name.
In `@tests/unit/cache/test_postgres_cache.py`:
- Around line 20-23: The new pytest fixture and several test functions in
Postgres cache tests are missing type annotations. Update _suppress_health_loop
to declare Generator[None, None, None] as its return type, and add -> None to
the affected test functions in test_postgres_cache so the new autouse fixture
and tests are fully typed and consistent with the rest of the module.
In `@tests/unit/utils/test_postgres.py`:
- Around line 108-147: Add explicit -> None return annotations to the new test
methods in test_postgres.py, including the helper failing_connect nested in
test_connection_error_reconnect_failure_raises_cache_error, to keep the test
signatures consistent. Also tighten
test_database_error_wraps_in_cache_error_no_retry by asserting the Connectable
instance does not attempt reconnect/retry when do_work_database_error is called;
use the existing Connectable, connect, and do_work_database_error symbols to
place the assertion near the current CacheError check.
- Around line 124-129: The test
`test_database_error_wraps_in_cache_error_no_retry` only verifies that
`do_work_database_error()` raises `CacheError`, but it does not prove the
reconnect path was skipped. After the initial `c.connect()` call, patch or mock
`Connectable.connect` and assert it is not called when
`do_work_database_error()` raises, so this test checks the “no retry” behavior
as well as the `CacheError` wrapping.
---
Nitpick comments:
In `@tests/unit/cache/test_postgres_cache.py`:
- Around line 836-865: The health-check tests are bypassing the real behavior by
mutating PostgresCache internals directly, so they won’t catch regressions in
_health_check_loop(). Update the tests to drive the loop logic through a
single-iteration helper or a controllable sleep/event on PostgresCache, then
assert the resulting _health_status and _consecutive_failures values after
simulated success/failure. Use the existing PostgresCache, _health_check_loop,
_mark_unhealthy, and ready() symbols to locate the behavior and keep the
assertions focused on observable state changes rather than direct field
assignment.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3804766b-967a-49bb-b307-520f135d2133
📒 Files selected for processing (10)
ols/app/endpoints/health.pyols/app/models/config.pyols/app/models/models.pyols/constants.pyols/src/cache/postgres_cache.pyols/utils/postgres.pytests/unit/app/endpoints/test_health.pytests/unit/app/models/test_config.pytests/unit/cache/test_postgres_cache.pytests/unit/utils/test_postgres.py
| self._lock_timeout = config.lock_timeout | ||
| self.capacity = config.max_entries | ||
|
|
||
| self._health_status = True | ||
| self._consecutive_failures = 0 | ||
| self._health_lock = threading.Lock() | ||
| self._health_check_interval = config.health_check_interval |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Validate timeout and interval values before the daemon uses them.
health_check_interval reaches time.sleep() before the try, so a negative value kills the health thread and 0 creates a hot polling loop. lock_timeout should also reject negative values unless indefinite blocking is intentional. The related PostgresConfig contract currently exposes these as plain int; validate them there with PositiveInt/non-negative types. As per coding guidelines, ols/**/*.py: Use Pydantic models for configuration and data validation.
Also applies to: 211-215, 254-257
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/src/cache/postgres_cache.py` around lines 159 - 165, Validate the
Postgres cache timing settings before they reach the daemon:
`health_check_interval` must not allow negative values or zero-hot-loop
behavior, and `lock_timeout` should reject negative values unless indefinite
blocking is explicitly intended. Add Pydantic validation in `PostgresConfig`
(using appropriate constrained types such as non-negative/positive ints) and
keep the runtime consumers in `PostgresCache` safe by relying on those validated
values in the initialization and health-check logic. Update any related uses in
`PostgresCache` and its config wiring so the daemon never calls `time.sleep()`
or lock acquisition with invalid values.
Source: Coding guidelines
| except (psycopg2.OperationalError, psycopg2.InterfaceError) as e: | ||
| logger.warning( | ||
| "Connection error in %s, attempting reconnect: %s", f.__name__, e | ||
| ) | ||
| if hasattr(connectable, "_mark_unhealthy"): | ||
| connectable._mark_unhealthy() | ||
| try: | ||
| connectable.connect() | ||
| except Exception as reconnect_err: | ||
| raise CacheError( | ||
| f"reconnect failed in {f.__name__}", reconnect_err | ||
| ) from reconnect_err | ||
| return f(connectable, *args, **kwargs) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the relevant file and nearby symbols.
git ls-files | rg '^ols/utils/postgres\.py$|^ols/.+\.py$' >/dev/null
ast-grep outline ols/utils/postgres.py --view expanded || true
echo '--- file excerpt ---'
wc -l ols/utils/postgres.py
sed -n '1,220p' ols/utils/postgres.py
echo '--- usages ---'
rg -n "insert_or_append|_mark_unhealthy|connect\(|OperationalError|InterfaceError|CacheError" ols -g '*.py'Repository: openshift/lightspeed-service
Length of output: 8349
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the cache implementation to understand commit/append semantics.
fd -a 'postgres.py' ols
fd -a 'cache.py' ols
fd -a '*cache*' ols | sed -n '1,120p'
echo '--- Postgres cache excerpt ---'
sed -n '1,260p' ols/utils/postgres.py
echo '--- cache-related excerpts ---'
rg -n "class .*Cache|def insert_or_append|append|commit|upsert|ON CONFLICT|executemany|execute" ols -g '*.py'Repository: openshift/lightspeed-service
Length of output: 639
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- postgres_cache outline ---'
ast-grep outline ols/src/cache/postgres_cache.py --view expanded || true
echo '--- insert_or_append section ---'
sed -n '280,380p' ols/src/cache/postgres_cache.py
echo '--- decorators/usages in postgres_cache ---'
rg -n "^\\s*`@connection`|def (get|insert_or_append|delete|list|set_topic_summary)|commit\\(|ON CONFLICT|DO UPDATE|DO NOTHING|append" ols/src/cache/postgres_cache.pyRepository: openshift/lightspeed-service
Length of output: 6822
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- cache interface excerpt ---'
sed -n '1,220p' ols/src/cache/cache.py
echo '--- in-memory implementation excerpt ---'
sed -n '1,180p' ols/src/cache/in_memory_cache.pyRepository: openshift/lightspeed-service
Length of output: 10989
🏁 Script executed:
#!/bin/bash
set -euo pipefail
python3 - <<'PY'
from pathlib import Path
path = Path('ols/src/cache/postgres_cache.py')
text = path.read_text()
for needle in ['`@connection`', 'def insert_or_append', 'def get', 'def delete', 'def list', 'def set_topic_summary']:
idx = text.find(needle)
if idx != -1:
start = text.rfind('\n', 0, idx-120)
end = text.find('\n\n', idx)
print(f'### {needle}')
print(text[max(0, idx-120):min(len(text), idx+1200)])
print()
PYRepository: openshift/lightspeed-service
Length of output: 8205
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -A80 -B20 "^\\s*`@connection`|^\\s*def insert_or_append|^\\s*def get|^\\s*def delete|^\\s*def list|^\\s*def set_topic_summary" ols/src/cache/postgres_cache.pyRepository: openshift/lightspeed-service
Length of output: 11834
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '280,370p' ols/src/cache/postgres_cache.pyRepository: openshift/lightspeed-service
Length of output: 3988
🏁 Script executed:
#!/bin/bash
set -euo pipefail
cat -n ols/src/cache/postgres_cache.py | sed -n '280,370p'Repository: openshift/lightspeed-service
Length of output: 4625
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "`@connection`" ols -g '*.py'Repository: openshift/lightspeed-service
Length of output: 724
🏁 Script executed:
#!/bin/bash
set -euo pipefail
cat -n ols/src/cache/postgres_cache.py | sed -n '1,260p'Repository: openshift/lightspeed-service
Length of output: 11615
🏁 Script executed:
#!/bin/bash
set -euo pipefail
cat -n ols/src/cache/postgres_cache.py | sed -n '285,360p'Repository: openshift/lightspeed-service
Length of output: 3885
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "class .*\\(PostgresBase\\)" ols -g '*.py'Repository: openshift/lightspeed-service
Length of output: 245
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo '--- PostgresBase subclasses ---'
rg -n "class .*\\(PostgresBase\\)" ols -g '*.py'
echo '--- connection decorator uses ---'
rg -n "`@connection`" ols -g '*.py' || trueRepository: openshift/lightspeed-service
Length of output: 869
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "`@connection`" ols -g '*.py' || trueRepository: openshift/lightspeed-service
Length of output: 724
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "`@connection`" ols -g '*.py' || trueRepository: openshift/lightspeed-service
Length of output: 724
Limit @connection retries on write methods. PostgresCache.insert_or_append is retried after OperationalError/InterfaceError, but it appends to the stored history and increments message_count before commit(). If the connection drops after PostgreSQL has already committed, the retry can duplicate the entry and metadata update.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/utils/postgres.py` around lines 33 - 45, The `@connection` retry logic in
the postgres decorator is causing duplicate writes for methods like
PostgresCache.insert_or_append because the operation may have already committed
before reconnect and retry. Update the retry behavior in the connection wrapper
to avoid automatically retrying write/mutating methods after
OperationalError/InterfaceError, and keep the retry path only for safe read
operations or explicitly idempotent calls identified by the wrapped function
name.
eea3052 to
15f8455
Compare
Implement background health-check loop, smarter error classification in the @connection decorator, operation timeouts, and enhanced liveness/readiness probes so OLS can recover automatically when the backing PostgreSQL database is restarted. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
15f8455 to
6e37719
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ols/src/cache/postgres_cache.py`:
- Around line 350-353: The DatabaseError handling in PostgresCache should use
the same rollback protection as the operational-error path, because calling
connection.rollback() directly can hide the original psycopg2.DatabaseError and
prevent the CacheError wrapping from happening. Update the exception blocks in
PostgresCache.insert_or_append and the other matching DatabaseError path to call
_safe_rollback() instead of self.connection.rollback(), while keeping the
existing logger.error and CacheError("PostgresCache.insert_or_append", e)
behavior intact.
- Around line 184-188: The `_mark_unhealthy()` path only updates
`_health_status`, so repeated request-path failures from the `@connection` retry
flow do not advance `_consecutive_failures` toward the liveness threshold.
Update `_mark_unhealthy()` in `PostgresCache` to also increment the failure
counter under `_health_lock`, keeping the unhealthy flag and failure tracking in
sync so `is_healthy`/liveness checks can trip promptly after repeated
cache-operation DB failures.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c84be3c2-a66c-475a-91d3-5d332a00fb83
📒 Files selected for processing (10)
docs/openapi.jsonols/app/endpoints/health.pyols/app/models/config.pyols/constants.pyols/src/cache/postgres_cache.pyols/utils/postgres.pytests/unit/app/endpoints/test_health.pytests/unit/app/models/test_config.pytests/unit/cache/test_postgres_cache.pytests/unit/utils/test_postgres.py
🚧 Files skipped from review as they are similar to previous changes (8)
- ols/app/endpoints/health.py
- ols/constants.py
- tests/unit/cache/test_postgres_cache.py
- tests/unit/app/endpoints/test_health.py
- ols/app/models/config.py
- tests/unit/utils/test_postgres.py
- ols/utils/postgres.py
- tests/unit/app/models/test_config.py
There was a problem hiding this comment.
🧹 Nitpick comments (2)
ols/app/endpoints/health.py (1)
134-144: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winEncapsulate health-state access instead of reaching into PostgresCache internals.
The endpoint directly touches
cache._health_lockandcache._consecutive_failures, private attributes ofPostgresCache. This crosses the app/cache layer boundary and couples the endpoint to the cache's internal locking scheme.Consider exposing a public method (e.g.
PostgresCache.is_healthy(threshold) -> bool) that encapsulates the lock + comparison, sohealth.pyonly calls a stable public API.As per coding guidelines, "Respect architectural boundaries: Do not cross module or layer boundaries, even when it is the shorter path."
♻️ Proposed refactor
- cache = config._conversation_cache - if isinstance(cache, PostgresCache): - threshold = config.ols_config.liveness_db_failure_threshold - with cache._health_lock: - failures = cache._consecutive_failures - if failures >= threshold: - response.status_code = status.HTTP_503_SERVICE_UNAVAILABLE - return LivenessResponse(alive=False, reason="database unreachable") + cache = config._conversation_cache + if isinstance(cache, PostgresCache): + threshold = config.ols_config.liveness_db_failure_threshold + if not cache.is_healthy(threshold): + response.status_code = status.HTTP_503_SERVICE_UNAVAILABLE + return LivenessResponse(alive=False, reason="database unreachable")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ols/app/endpoints/health.py` around lines 134 - 144, The liveness probe in liveness_probe_get_method is reaching into PostgresCache private state (_health_lock and _consecutive_failures), so move that lock-and-compare logic behind a public API on PostgresCache such as is_healthy(threshold) -> bool. Update health.py to call only the new public method and keep the endpoint from depending on cache internals.Source: Coding guidelines
tests/unit/app/endpoints/test_health.py (1)
230-283: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winHoist repeated inline
PostgresConfigimport to module level.
from ols.app.models.config import PostgresConfigis repeated inline in three test functions (lines 233, 251, 271). Unless this is needed to avoid a circular import, move it to the top of the file.As per coding guidelines, "module-level imports by default, inline imports only when needed for circular dependencies or deferred optional heavy dependencies."
♻️ Proposed refactor
+from ols.app.models.config import PostgresConfig from ols.src.cache.postgres_cache import PostgresCacheAnd remove the three inline
from ols.app.models.config import PostgresConfigoccurrences inside the test bodies.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/app/endpoints/test_health.py` around lines 230 - 283, The three liveness probe tests repeat an inline PostgresConfig import inside each test body; hoist that import to the module level in the test file and remove the repeated local imports from test_liveness_probe_returns_alive_when_postgres_healthy, test_liveness_probe_returns_503_when_postgres_unhealthy, and test_liveness_probe_returns_alive_when_below_threshold. Keep the tests using PostgresConfig exactly as before, just reference the shared module-level import to follow the import guideline unless a circular dependency requires otherwise.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@ols/app/endpoints/health.py`:
- Around line 134-144: The liveness probe in liveness_probe_get_method is
reaching into PostgresCache private state (_health_lock and
_consecutive_failures), so move that lock-and-compare logic behind a public API
on PostgresCache such as is_healthy(threshold) -> bool. Update health.py to call
only the new public method and keep the endpoint from depending on cache
internals.
In `@tests/unit/app/endpoints/test_health.py`:
- Around line 230-283: The three liveness probe tests repeat an inline
PostgresConfig import inside each test body; hoist that import to the module
level in the test file and remove the repeated local imports from
test_liveness_probe_returns_alive_when_postgres_healthy,
test_liveness_probe_returns_503_when_postgres_unhealthy, and
test_liveness_probe_returns_alive_when_below_threshold. Keep the tests using
PostgresConfig exactly as before, just reference the shared module-level import
to follow the import guideline unless a circular dependency requires otherwise.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 51571fc0-22a7-4977-978e-38af5cbf15fd
📒 Files selected for processing (8)
ols/app/endpoints/health.pyols/app/models/config.pyols/app/models/models.pyols/src/cache/postgres_cache.pyols/utils/postgres.pytests/unit/app/endpoints/test_health.pytests/unit/cache/test_postgres_cache.pytests/unit/utils/test_postgres.py
🚧 Files skipped from review as they are similar to previous changes (5)
- ols/app/models/config.py
- ols/src/cache/postgres_cache.py
- tests/unit/cache/test_postgres_cache.py
- ols/utils/postgres.py
- tests/unit/utils/test_postgres.py
- Align 503 liveness response with LivenessResponse schema by adding optional reason field to LivenessResponse model and using Response parameter to set status code - Add NonNegativeInt/PositiveInt validation for PostgresConfig timeout fields (statement_timeout, lock_timeout, health_check_interval) - Validate liveness_db_failure_threshold in OLSConfig with helper function to ensure positive integer value - Add connect_timeout and statement_timeout to health check connection - Increment _consecutive_failures in _mark_unhealthy for proper liveness threshold tracking - Use _safe_rollback in DatabaseError exception paths - Close cursor with try/finally in postgres.py for proper cleanup - Add type annotations to test fixtures and functions - Tighten no-retry assertion in test_postgres.py to verify reconnect is not called on DatabaseError Co-authored-by: Cursor <cursoragent@cursor.com>
7e46a50 to
c81358c
Compare
Adversarial Code Review - PostgreSQL Auto-Recovery (OLS-3221)Critical Issues1. Race Condition: Failure Counter Reset vs Increment (
|
Adversarial Review - Updated Code (Follow-up)Issues Addressed ✅Great progress! The following issues from my previous review have been addressed:
Remaining Issues1. Health Connection State Inconsistency (
|
- Fix encapsulation: add public consecutive_failures property, remove direct access to private _health_lock/_consecutive_failures from liveness endpoint - Fix initial connection bypass: move connect() inside try/except in @connection decorator so failures get _mark_unhealthy() and CacheError - Make health check connect timeout configurable via PostgresConfig - Add graceful shutdown: shutdown() joins thread and closes connection - Fix delayed initial check: run health probe before sleep, use Event.wait() instead of time.sleep() - Fix startup race: skip failure counter on first health check to avoid premature liveness failures - Fix stale connection reference: set _health_connection = None after close before reconnect - Reject zero lock_timeout: change from NonNegativeInt to PositiveInt Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ols/src/cache/postgres_cache.py`:
- Around line 219-230: The Postgres cache shutdown path is implemented in
PostgresCache.shutdown() but is never invoked by the app lifecycle. Update the
app owner in main application startup/shutdown flow to call
config.conversation_cache.shutdown() during process teardown so the health
thread is joined and the connection is closed. Use the existing
CacheFactory-created cache instance and wire the call into the app
shutdown/lifespan handling in ols/app/main.py.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b907267e-0034-4eef-b88d-2b21f0a353dc
📒 Files selected for processing (9)
ols/app/endpoints/health.pyols/app/models/config.pyols/app/models/models.pyols/constants.pyols/src/cache/postgres_cache.pyols/utils/postgres.pytests/unit/app/endpoints/test_health.pytests/unit/cache/test_postgres_cache.pytests/unit/utils/test_postgres.py
✅ Files skipped from review due to trivial changes (1)
- ols/constants.py
🚧 Files skipped from review as they are similar to previous changes (7)
- ols/app/models/models.py
- tests/unit/app/endpoints/test_health.py
- ols/app/endpoints/health.py
- ols/utils/postgres.py
- tests/unit/cache/test_postgres_cache.py
- ols/app/models/config.py
- tests/unit/utils/test_postgres.py
| def shutdown(self) -> None: | ||
| """Stop the health-check thread and close its connection.""" | ||
| self._shutdown_event.set() | ||
| self._health_thread.join(timeout=self._health_check_connect_timeout) | ||
| if self._health_connection is not None: | ||
| try: | ||
| self._health_connection.close() | ||
| except Exception as close_err: | ||
| logger.debug( | ||
| "Failed to close health connection on shutdown: %s", close_err | ||
| ) | ||
| self._health_connection = None |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find any callers of PostgresCache.shutdown() outside its definition/tests.
rg -nP '\.shutdown\s*\(' --type=py -g '!**/test*/**' -C3
# Also check lifespan/atexit hooks that might wire cache cleanup.
rg -nP '(lifespan|atexit|on_event\(.?shutdown|register\()' --type=py -C2Repository: openshift/lightspeed-service
Length of output: 166
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Locate the cache class and its usages.
git ls-files | rg '^ols/src/cache/postgres_cache\.py$|^ols/src/.*\.py$|^.*(main|app|lifespan|startup|shutdown|atexit).*\.(py|ts|js)$' || true
printf '\n## PostgresCache definition/usages\n'
rg -n "class PostgresCache|PostgresCache\(" ols/src -g '*.py' -C 2 || true
printf '\n## shutdown() callers\n'
rg -n "\.shutdown\s*\(" ols/src -g '*.py' -C 2 || true
printf '\n## lifecycle hooks / cleanup wiring\n'
rg -n "(lifespan|atexit|on_event\(.?shutdown|register\(|shutdown_event|startup_event|app\.on_event|FastAPI\()" ols/src -g '*.py' -C 2 || trueRepository: openshift/lightspeed-service
Length of output: 6600
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect the cache module and nearby app/bootstrap files for cleanup wiring.
ast-grep outline ols/src/cache/postgres_cache.py || true
printf '\n## postgres_cache.py excerpt\n'
sed -n '1,280p' ols/src/cache/postgres_cache.py | cat -n
printf '\n## repo files mentioning cache module\n'
rg -n "postgres_cache|PostgresCache|cache" ols/src -g '*.py' -C 2 || trueRepository: openshift/lightspeed-service
Length of output: 50386
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Find application bootstrap/lifespan files in the repo.
fd -a -t f -e py 'app|main|lifespan|startup|shutdown|fastapi|asgi' ols || trueRepository: openshift/lightspeed-service
Length of output: 414
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '## app/main.py outline\n'
ast-grep outline ols/app/main.py || true
printf '\n## app/main.py relevant lines\n'
rg -n "FastAPI|lifespan|startup|shutdown|conversation_cache|CacheFactory|Cache|router|include_router|on_event|atexit" ols/app/main.py -C 3 || true
printf '\n## config/cache ownership\n'
rg -n "conversation_cache|CacheFactory|cache_factory|PostgresCache|InMemoryCache|cache:" ols/app/models/config.py ols/src/cache/cache_factory.py ols/app -g '*.py' -C 3 || trueRepository: openshift/lightspeed-service
Length of output: 15578
🏁 Script executed:
#!/bin/bash
set -euo pipefail
printf '## app/models/config.py outline\n'
ast-grep outline ols/app/models/config.py || true
printf '\n## cache-related definitions in config.py\n'
rg -n "conversation_cache|CacheFactory|cache|shutdown|lifecycle|lifespan" ols/app/models/config.py -C 4 || true
printf '\n## cache factory file\n'
ast-grep outline ols/src/cache/cache_factory.py || true
sed -n '1,220p' ols/src/cache/cache_factory.py | cat -nRepository: openshift/lightspeed-service
Length of output: 13311
Wire PostgresCache.shutdown() into app shutdown
ols/app/main.py has no lifespan or shutdown hook, and CacheFactory only constructs the cache. Call config.conversation_cache.shutdown() from the app owner so the health thread and connection are closed before process exit.
🧰 Tools
🪛 Ruff (0.15.20)
[warning] 226-226: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ols/src/cache/postgres_cache.py` around lines 219 - 230, The Postgres cache
shutdown path is implemented in PostgresCache.shutdown() but is never invoked by
the app lifecycle. Update the app owner in main application startup/shutdown
flow to call config.conversation_cache.shutdown() during process teardown so the
health thread is joined and the connection is closed. Use the existing
CacheFactory-created cache instance and wire the call into the app
shutdown/lifespan handling in ols/app/main.py.
- Update test_liveness assertion to include reason: null field added to LivenessResponse model - Regenerate docs/openapi.json to reflect config and model changes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
@sriroopar: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Implement background health-check loop, smarter error classification in the @connection decorator, operation timeouts, and enhanced liveness/readiness probes so OLS can recover automatically when the backing PostgreSQL database is restarted.
Description
Type of change
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
/livenesscan now return 503 when the database-backed cache is failing, including a “database unreachable” reason.Bug Fixes
Documentation
/liveness, including the new optionalreasonfield.Tests
reason.