Skip to content

OLS-3221 Add PostgreSQL auto-recovery after DB restart#2964

Open
sriroopar wants to merge 4 commits into
openshift:mainfrom
sriroopar:postgres-auto-recovery
Open

OLS-3221 Add PostgreSQL auto-recovery after DB restart#2964
sriroopar wants to merge 4 commits into
openshift:mainfrom
sriroopar:postgres-auto-recovery

Conversation

@sriroopar

@sriroopar sriroopar commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Implement background health-check loop, smarter error classification in the @connection decorator, operation timeouts, and enhanced liveness/readiness probes so OLS can recover automatically when the backing PostgreSQL database is restarted.

Description

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • Enhanced service health checks: GET /liveness can now return 503 when the database-backed cache is failing, including a “database unreachable” reason.
    • Added configurable health threshold and Postgres/cache health-tuning settings.
  • Bug Fixes

    • More reliable liveness reporting by using the cache’s background health state, with improved reconnect/recovery behavior.
  • Documentation

    • Updated OpenAPI for 503 /liveness, including the new optional reason field.
  • Tests

    • Updated and expanded liveness, configuration, and cache-health tests to cover threshold behavior and reason.

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 15deedcd-abc4-4923-a915-c3b704d6e46a

📥 Commits

Reviewing files that changed from the base of the PR and between f7d7700 and cdb26ca.

📒 Files selected for processing (2)
  • docs/openapi.json
  • tests/integration/test_liveness_readiness.py

📝 Walkthrough

Walkthrough

The PR adds Postgres cache health tracking and timeout settings, updates connection and cache error handling, and changes /liveness to return 503 when the Postgres failure threshold is reached.

Changes

Postgres health and liveness

Layer / File(s) Summary
Config and response models
ols/constants.py, ols/app/models/config.py, ols/app/models/models.py, tests/unit/app/models/test_config.py
New Postgres timeout settings, cache health tuning, and liveness threshold configuration are added, and LivenessResponse gains an optional reason field.
Connection retry and timeout setup
ols/utils/postgres.py, tests/unit/utils/test_postgres.py
The Postgres connection decorator retries selected connection errors, wraps database errors in CacheError, and connect() sets statement_timeout.
Postgres cache health loop and locking
ols/src/cache/postgres_cache.py, tests/unit/cache/test_postgres_cache.py
PostgresCache adds background health tracking, bounded lock acquisition, cached readiness, and updated cache error handling with tests.
Liveness endpoint and API contract
ols/app/endpoints/health.py, docs/openapi.json, tests/unit/app/endpoints/test_health.py, tests/integration/test_liveness_readiness.py
/liveness now returns 503 when Postgres failure streaks reach the configured threshold, and the API docs and tests reflect the new response shape.

Estimated code review effort: 4 (Complex) | ~60 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: PostgreSQL auto-recovery after a database restart.
Docstring Coverage ✅ Passed Docstring coverage is 96.72% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci openshift-ci Bot requested review from bparees and tisnik June 24, 2026 17:46
@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign blublinsky for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

🧹 Nitpick comments (1)
tests/unit/cache/test_postgres_cache.py (1)

836-865: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy lift

Exercise the health-check behavior instead of assigning its result.

These tests directly set _health_status and _consecutive_failures, so they would still pass if _health_check_loop() stopped updating either field. Extract a single-iteration health-check helper or inject a controllable sleep/event so the tests drive the success and failure paths. As per coding guidelines, tests/**/*.py: Assert specific values and behaviors in tests, not just that code runs without error.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/cache/test_postgres_cache.py` around lines 836 - 865, The
health-check tests are bypassing the real behavior by mutating PostgresCache
internals directly, so they won’t catch regressions in _health_check_loop().
Update the tests to drive the loop logic through a single-iteration helper or a
controllable sleep/event on PostgresCache, then assert the resulting
_health_status and _consecutive_failures values after simulated success/failure.
Use the existing PostgresCache, _health_check_loop, _mark_unhealthy, and ready()
symbols to locate the behavior and keep the assertions focused on observable
state changes rather than direct field assignment.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ols/app/endpoints/health.py`:
- Around line 126-129: The 503 liveness response in the health endpoint is
returning an HTTPException shape instead of the declared LivenessResponse
contract. Update the liveness failure path in the health endpoint so the 503
body matches LivenessResponse (for example, by returning that model directly),
or change the documented 503 response model to the actual error payload; keep
the response schema and the OpenAPI declaration in sync around the liveness
handler.

In `@ols/app/models/config.py`:
- Around line 801-803: The config model currently accepts negative values for
statement_timeout, lock_timeout, and health_check_interval, so tighten
validation in the Pydantic config class that defines these fields. Add explicit
field constraints or validators on statement_timeout and lock_timeout to reject
negative integers, and make health_check_interval strictly positive so the
health-check loop and PostgresCache lock acquisition keep their timing
guarantees. Use the existing config model/field definitions in Config to enforce
these contracts at load time rather than downstream.
- Around line 1166-1167: `OLSConfig.__init__` is currently taking
`liveness_db_failure_threshold` directly from `data.get(...)`, so it can remain
a string or accept invalid values. Parse this field as an int when loading
config, and validate that the value is at least 1; if it is missing,
non-numeric, zero, or negative, raise `checks.InvalidConfigurationError` before
the value is used by `/liveness`.

In `@ols/src/cache/postgres_cache.py`:
- Around line 159-165: Validate the Postgres cache timing settings before they
reach the daemon: `health_check_interval` must not allow negative values or
zero-hot-loop behavior, and `lock_timeout` should reject negative values unless
indefinite blocking is explicitly intended. Add Pydantic validation in
`PostgresConfig` (using appropriate constrained types such as
non-negative/positive ints) and keep the runtime consumers in `PostgresCache`
safe by relying on those validated values in the initialization and health-check
logic. Update any related uses in `PostgresCache` and its config wiring so the
daemon never calls `time.sleep()` or lock acquisition with invalid values.
- Around line 189-219: The health-check path in _connect_health and
_health_check_loop can block indefinitely because neither the psycopg2.connect
call nor the SELECT 1 probe has a timeout. Update the PostgresCache health
connection setup to pass an explicit connect timeout in connect_kwargs and add a
statement timeout before executing the cursor probe, so ready() does not wait on
a hung network or server and _health_status can recover promptly.
- Around line 194-195: The broad exception handling in PostgresCache needs to be
narrowed to satisfy Ruff BLE001. Update the catch blocks in _connect_health,
_health_check_loop, _safe_rollback, and _safe_set_autocommit to handle only the
specific psycopg2 or attribute-related exceptions they can actually raise, and
keep the existing logger/debug behavior in those paths. If any catch-all is
intentionally required to keep the daemon alive, add an explicit # noqa: BLE001
on that except clause rather than leaving a bare except Exception.

In `@ols/utils/postgres.py`:
- Around line 106-108: The new cursor in the statement-timeout setup can leak if
`cursor.execute` fails during schema initialization. Update the `Postgres`
initialization flow around the `cursor = self.connection.cursor()` and `SET
statement_timeout` call to use a `try/finally` so `cursor.close()` always runs,
even on exceptions; keep the fix localized to this cursor-handling block.
- Around line 29-45: The wrapper in postgres connection handling is calling
connectable.connect() before the try/except, so a failed initial connect skips
the same recovery path as query failures. Move the initial connect() call into
the connection-error handling flow inside wrapper() so both the first connect
and the retry go through the same psycopg2.OperationalError/InterfaceError
handling, including _mark_unhealthy() and CacheError wrapping in the reconnect
path.
- Around line 33-45: The `@connection` retry logic in the postgres decorator is
causing duplicate writes for methods like PostgresCache.insert_or_append because
the operation may have already committed before reconnect and retry. Update the
retry behavior in the connection wrapper to avoid automatically retrying
write/mutating methods after OperationalError/InterfaceError, and keep the retry
path only for safe read operations or explicitly idempotent calls identified by
the wrapped function name.

In `@tests/unit/cache/test_postgres_cache.py`:
- Around line 20-23: The new pytest fixture and several test functions in
Postgres cache tests are missing type annotations. Update _suppress_health_loop
to declare Generator[None, None, None] as its return type, and add -> None to
the affected test functions in test_postgres_cache so the new autouse fixture
and tests are fully typed and consistent with the rest of the module.

In `@tests/unit/utils/test_postgres.py`:
- Around line 108-147: Add explicit -> None return annotations to the new test
methods in test_postgres.py, including the helper failing_connect nested in
test_connection_error_reconnect_failure_raises_cache_error, to keep the test
signatures consistent. Also tighten
test_database_error_wraps_in_cache_error_no_retry by asserting the Connectable
instance does not attempt reconnect/retry when do_work_database_error is called;
use the existing Connectable, connect, and do_work_database_error symbols to
place the assertion near the current CacheError check.
- Around line 124-129: The test
`test_database_error_wraps_in_cache_error_no_retry` only verifies that
`do_work_database_error()` raises `CacheError`, but it does not prove the
reconnect path was skipped. After the initial `c.connect()` call, patch or mock
`Connectable.connect` and assert it is not called when
`do_work_database_error()` raises, so this test checks the “no retry” behavior
as well as the `CacheError` wrapping.

---

Nitpick comments:
In `@tests/unit/cache/test_postgres_cache.py`:
- Around line 836-865: The health-check tests are bypassing the real behavior by
mutating PostgresCache internals directly, so they won’t catch regressions in
_health_check_loop(). Update the tests to drive the loop logic through a
single-iteration helper or a controllable sleep/event on PostgresCache, then
assert the resulting _health_status and _consecutive_failures values after
simulated success/failure. Use the existing PostgresCache, _health_check_loop,
_mark_unhealthy, and ready() symbols to locate the behavior and keep the
assertions focused on observable state changes rather than direct field
assignment.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3804766b-967a-49bb-b307-520f135d2133

📥 Commits

Reviewing files that changed from the base of the PR and between b51ca02 and 9868f61.

📒 Files selected for processing (10)
  • ols/app/endpoints/health.py
  • ols/app/models/config.py
  • ols/app/models/models.py
  • ols/constants.py
  • ols/src/cache/postgres_cache.py
  • ols/utils/postgres.py
  • tests/unit/app/endpoints/test_health.py
  • tests/unit/app/models/test_config.py
  • tests/unit/cache/test_postgres_cache.py
  • tests/unit/utils/test_postgres.py

Comment thread ols/app/endpoints/health.py
Comment thread ols/app/models/config.py Outdated
Comment thread ols/app/models/config.py
Comment on lines +159 to +165
self._lock_timeout = config.lock_timeout
self.capacity = config.max_entries

self._health_status = True
self._consecutive_failures = 0
self._health_lock = threading.Lock()
self._health_check_interval = config.health_check_interval

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Validate timeout and interval values before the daemon uses them.

health_check_interval reaches time.sleep() before the try, so a negative value kills the health thread and 0 creates a hot polling loop. lock_timeout should also reject negative values unless indefinite blocking is intentional. The related PostgresConfig contract currently exposes these as plain int; validate them there with PositiveInt/non-negative types. As per coding guidelines, ols/**/*.py: Use Pydantic models for configuration and data validation.

Also applies to: 211-215, 254-257

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/src/cache/postgres_cache.py` around lines 159 - 165, Validate the
Postgres cache timing settings before they reach the daemon:
`health_check_interval` must not allow negative values or zero-hot-loop
behavior, and `lock_timeout` should reject negative values unless indefinite
blocking is explicitly intended. Add Pydantic validation in `PostgresConfig`
(using appropriate constrained types such as non-negative/positive ints) and
keep the runtime consumers in `PostgresCache` safe by relying on those validated
values in the initialization and health-check logic. Update any related uses in
`PostgresCache` and its config wiring so the daemon never calls `time.sleep()`
or lock acquisition with invalid values.

Source: Coding guidelines

Comment thread ols/src/cache/postgres_cache.py
Comment thread ols/utils/postgres.py
Comment on lines +33 to +45
except (psycopg2.OperationalError, psycopg2.InterfaceError) as e:
logger.warning(
"Connection error in %s, attempting reconnect: %s", f.__name__, e
)
if hasattr(connectable, "_mark_unhealthy"):
connectable._mark_unhealthy()
try:
connectable.connect()
except Exception as reconnect_err:
raise CacheError(
f"reconnect failed in {f.__name__}", reconnect_err
) from reconnect_err
return f(connectable, *args, **kwargs)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the relevant file and nearby symbols.
git ls-files | rg '^ols/utils/postgres\.py$|^ols/.+\.py$' >/dev/null
ast-grep outline ols/utils/postgres.py --view expanded || true

echo '--- file excerpt ---'
wc -l ols/utils/postgres.py
sed -n '1,220p' ols/utils/postgres.py

echo '--- usages ---'
rg -n "insert_or_append|_mark_unhealthy|connect\(|OperationalError|InterfaceError|CacheError" ols -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 8349


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the cache implementation to understand commit/append semantics.
fd -a 'postgres.py' ols
fd -a 'cache.py' ols
fd -a '*cache*' ols | sed -n '1,120p'

echo '--- Postgres cache excerpt ---'
sed -n '1,260p' ols/utils/postgres.py

echo '--- cache-related excerpts ---'
rg -n "class .*Cache|def insert_or_append|append|commit|upsert|ON CONFLICT|executemany|execute" ols -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 639


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- postgres_cache outline ---'
ast-grep outline ols/src/cache/postgres_cache.py --view expanded || true

echo '--- insert_or_append section ---'
sed -n '280,380p' ols/src/cache/postgres_cache.py

echo '--- decorators/usages in postgres_cache ---'
rg -n "^\\s*`@connection`|def (get|insert_or_append|delete|list|set_topic_summary)|commit\\(|ON CONFLICT|DO UPDATE|DO NOTHING|append" ols/src/cache/postgres_cache.py

Repository: openshift/lightspeed-service

Length of output: 6822


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- cache interface excerpt ---'
sed -n '1,220p' ols/src/cache/cache.py

echo '--- in-memory implementation excerpt ---'
sed -n '1,180p' ols/src/cache/in_memory_cache.py

Repository: openshift/lightspeed-service

Length of output: 10989


🏁 Script executed:

#!/bin/bash
set -euo pipefail

python3 - <<'PY'
from pathlib import Path
path = Path('ols/src/cache/postgres_cache.py')
text = path.read_text()
for needle in ['`@connection`', 'def insert_or_append', 'def get', 'def delete', 'def list', 'def set_topic_summary']:
    idx = text.find(needle)
    if idx != -1:
        start = text.rfind('\n', 0, idx-120)
        end = text.find('\n\n', idx)
        print(f'### {needle}')
        print(text[max(0, idx-120):min(len(text), idx+1200)])
        print()
PY

Repository: openshift/lightspeed-service

Length of output: 8205


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n -A80 -B20 "^\\s*`@connection`|^\\s*def insert_or_append|^\\s*def get|^\\s*def delete|^\\s*def list|^\\s*def set_topic_summary" ols/src/cache/postgres_cache.py

Repository: openshift/lightspeed-service

Length of output: 11834


🏁 Script executed:

#!/bin/bash
set -euo pipefail

sed -n '280,370p' ols/src/cache/postgres_cache.py

Repository: openshift/lightspeed-service

Length of output: 3988


🏁 Script executed:

#!/bin/bash
set -euo pipefail

cat -n ols/src/cache/postgres_cache.py | sed -n '280,370p'

Repository: openshift/lightspeed-service

Length of output: 4625


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "`@connection`" ols -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 724


🏁 Script executed:

#!/bin/bash
set -euo pipefail

cat -n ols/src/cache/postgres_cache.py | sed -n '1,260p'

Repository: openshift/lightspeed-service

Length of output: 11615


🏁 Script executed:

#!/bin/bash
set -euo pipefail

cat -n ols/src/cache/postgres_cache.py | sed -n '285,360p'

Repository: openshift/lightspeed-service

Length of output: 3885


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "class .*\\(PostgresBase\\)" ols -g '*.py'

Repository: openshift/lightspeed-service

Length of output: 245


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- PostgresBase subclasses ---'
rg -n "class .*\\(PostgresBase\\)" ols -g '*.py'

echo '--- connection decorator uses ---'
rg -n "`@connection`" ols -g '*.py' || true

Repository: openshift/lightspeed-service

Length of output: 869


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "`@connection`" ols -g '*.py' || true

Repository: openshift/lightspeed-service

Length of output: 724


🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n "`@connection`" ols -g '*.py' || true

Repository: openshift/lightspeed-service

Length of output: 724


Limit @connection retries on write methods. PostgresCache.insert_or_append is retried after OperationalError/InterfaceError, but it appends to the stored history and increments message_count before commit(). If the connection drops after PostgreSQL has already committed, the retry can duplicate the entry and metadata update.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/utils/postgres.py` around lines 33 - 45, The `@connection` retry logic in
the postgres decorator is causing duplicate writes for methods like
PostgresCache.insert_or_append because the operation may have already committed
before reconnect and retry. Update the retry behavior in the connection wrapper
to avoid automatically retrying write/mutating methods after
OperationalError/InterfaceError, and keep the retry path only for safe read
operations or explicitly idempotent calls identified by the wrapped function
name.

Comment thread ols/utils/postgres.py Outdated
Comment thread tests/unit/cache/test_postgres_cache.py Outdated
Comment thread tests/unit/utils/test_postgres.py Outdated
Comment thread tests/unit/utils/test_postgres.py Outdated
@sriroopar sriroopar force-pushed the postgres-auto-recovery branch 2 times, most recently from eea3052 to 15f8455 Compare June 24, 2026 20:16
Implement background health-check loop, smarter error classification
in the @connection decorator, operation timeouts, and enhanced
liveness/readiness probes so OLS can recover automatically when the
backing PostgreSQL database is restarted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sriroopar sriroopar force-pushed the postgres-auto-recovery branch from 15f8455 to 6e37719 Compare June 24, 2026 20:32

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ols/src/cache/postgres_cache.py`:
- Around line 350-353: The DatabaseError handling in PostgresCache should use
the same rollback protection as the operational-error path, because calling
connection.rollback() directly can hide the original psycopg2.DatabaseError and
prevent the CacheError wrapping from happening. Update the exception blocks in
PostgresCache.insert_or_append and the other matching DatabaseError path to call
_safe_rollback() instead of self.connection.rollback(), while keeping the
existing logger.error and CacheError("PostgresCache.insert_or_append", e)
behavior intact.
- Around line 184-188: The `_mark_unhealthy()` path only updates
`_health_status`, so repeated request-path failures from the `@connection` retry
flow do not advance `_consecutive_failures` toward the liveness threshold.
Update `_mark_unhealthy()` in `PostgresCache` to also increment the failure
counter under `_health_lock`, keeping the unhealthy flag and failure tracking in
sync so `is_healthy`/liveness checks can trip promptly after repeated
cache-operation DB failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c84be3c2-a66c-475a-91d3-5d332a00fb83

📥 Commits

Reviewing files that changed from the base of the PR and between 9868f61 and 6e37719.

📒 Files selected for processing (10)
  • docs/openapi.json
  • ols/app/endpoints/health.py
  • ols/app/models/config.py
  • ols/constants.py
  • ols/src/cache/postgres_cache.py
  • ols/utils/postgres.py
  • tests/unit/app/endpoints/test_health.py
  • tests/unit/app/models/test_config.py
  • tests/unit/cache/test_postgres_cache.py
  • tests/unit/utils/test_postgres.py
🚧 Files skipped from review as they are similar to previous changes (8)
  • ols/app/endpoints/health.py
  • ols/constants.py
  • tests/unit/cache/test_postgres_cache.py
  • tests/unit/app/endpoints/test_health.py
  • ols/app/models/config.py
  • tests/unit/utils/test_postgres.py
  • ols/utils/postgres.py
  • tests/unit/app/models/test_config.py

Comment thread ols/src/cache/postgres_cache.py
Comment thread ols/src/cache/postgres_cache.py

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
ols/app/endpoints/health.py (1)

134-144: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Encapsulate health-state access instead of reaching into PostgresCache internals.

The endpoint directly touches cache._health_lock and cache._consecutive_failures, private attributes of PostgresCache. This crosses the app/cache layer boundary and couples the endpoint to the cache's internal locking scheme.

Consider exposing a public method (e.g. PostgresCache.is_healthy(threshold) -> bool) that encapsulates the lock + comparison, so health.py only calls a stable public API.

As per coding guidelines, "Respect architectural boundaries: Do not cross module or layer boundaries, even when it is the shorter path."

♻️ Proposed refactor
-    cache = config._conversation_cache
-    if isinstance(cache, PostgresCache):
-        threshold = config.ols_config.liveness_db_failure_threshold
-        with cache._health_lock:
-            failures = cache._consecutive_failures
-        if failures >= threshold:
-            response.status_code = status.HTTP_503_SERVICE_UNAVAILABLE
-            return LivenessResponse(alive=False, reason="database unreachable")
+    cache = config._conversation_cache
+    if isinstance(cache, PostgresCache):
+        threshold = config.ols_config.liveness_db_failure_threshold
+        if not cache.is_healthy(threshold):
+            response.status_code = status.HTTP_503_SERVICE_UNAVAILABLE
+            return LivenessResponse(alive=False, reason="database unreachable")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/app/endpoints/health.py` around lines 134 - 144, The liveness probe in
liveness_probe_get_method is reaching into PostgresCache private state
(_health_lock and _consecutive_failures), so move that lock-and-compare logic
behind a public API on PostgresCache such as is_healthy(threshold) -> bool.
Update health.py to call only the new public method and keep the endpoint from
depending on cache internals.

Source: Coding guidelines

tests/unit/app/endpoints/test_health.py (1)

230-283: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Hoist repeated inline PostgresConfig import to module level.

from ols.app.models.config import PostgresConfig is repeated inline in three test functions (lines 233, 251, 271). Unless this is needed to avoid a circular import, move it to the top of the file.

As per coding guidelines, "module-level imports by default, inline imports only when needed for circular dependencies or deferred optional heavy dependencies."

♻️ Proposed refactor
+from ols.app.models.config import PostgresConfig
 from ols.src.cache.postgres_cache import PostgresCache

And remove the three inline from ols.app.models.config import PostgresConfig occurrences inside the test bodies.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/app/endpoints/test_health.py` around lines 230 - 283, The three
liveness probe tests repeat an inline PostgresConfig import inside each test
body; hoist that import to the module level in the test file and remove the
repeated local imports from
test_liveness_probe_returns_alive_when_postgres_healthy,
test_liveness_probe_returns_503_when_postgres_unhealthy, and
test_liveness_probe_returns_alive_when_below_threshold. Keep the tests using
PostgresConfig exactly as before, just reference the shared module-level import
to follow the import guideline unless a circular dependency requires otherwise.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@ols/app/endpoints/health.py`:
- Around line 134-144: The liveness probe in liveness_probe_get_method is
reaching into PostgresCache private state (_health_lock and
_consecutive_failures), so move that lock-and-compare logic behind a public API
on PostgresCache such as is_healthy(threshold) -> bool. Update health.py to call
only the new public method and keep the endpoint from depending on cache
internals.

In `@tests/unit/app/endpoints/test_health.py`:
- Around line 230-283: The three liveness probe tests repeat an inline
PostgresConfig import inside each test body; hoist that import to the module
level in the test file and remove the repeated local imports from
test_liveness_probe_returns_alive_when_postgres_healthy,
test_liveness_probe_returns_503_when_postgres_unhealthy, and
test_liveness_probe_returns_alive_when_below_threshold. Keep the tests using
PostgresConfig exactly as before, just reference the shared module-level import
to follow the import guideline unless a circular dependency requires otherwise.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 51571fc0-22a7-4977-978e-38af5cbf15fd

📥 Commits

Reviewing files that changed from the base of the PR and between 6e37719 and 7e46a50.

📒 Files selected for processing (8)
  • ols/app/endpoints/health.py
  • ols/app/models/config.py
  • ols/app/models/models.py
  • ols/src/cache/postgres_cache.py
  • ols/utils/postgres.py
  • tests/unit/app/endpoints/test_health.py
  • tests/unit/cache/test_postgres_cache.py
  • tests/unit/utils/test_postgres.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • ols/app/models/config.py
  • ols/src/cache/postgres_cache.py
  • tests/unit/cache/test_postgres_cache.py
  • ols/utils/postgres.py
  • tests/unit/utils/test_postgres.py

- Align 503 liveness response with LivenessResponse schema by adding
  optional reason field to LivenessResponse model and using Response
  parameter to set status code
- Add NonNegativeInt/PositiveInt validation for PostgresConfig timeout
  fields (statement_timeout, lock_timeout, health_check_interval)
- Validate liveness_db_failure_threshold in OLSConfig with helper
  function to ensure positive integer value
- Add connect_timeout and statement_timeout to health check connection
- Increment _consecutive_failures in _mark_unhealthy for proper liveness
  threshold tracking
- Use _safe_rollback in DatabaseError exception paths
- Close cursor with try/finally in postgres.py for proper cleanup
- Add type annotations to test fixtures and functions
- Tighten no-retry assertion in test_postgres.py to verify reconnect
  is not called on DatabaseError

Co-authored-by: Cursor <cursoragent@cursor.com>
@sriroopar sriroopar force-pushed the postgres-auto-recovery branch from 7e46a50 to c81358c Compare July 2, 2026 13:22
@sriroopar

sriroopar commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Adversarial Code Review - PostgreSQL Auto-Recovery (OLS-3221)

Critical Issues

1. Race Condition: Failure Counter Reset vs Increment (postgres_cache.py:184-188, 226-230)

There's a race condition between _mark_unhealthy() (called from request threads via @connection decorator) and the health check loop resetting _consecutive_failures:

# Thread A (request): _mark_unhealthy()
self._consecutive_failures += 1  # increments to 3

# Thread B (health loop): success path
self._consecutive_failures = 0   # resets to 0

# Thread A continues but the failure was "lost"

Impact: Liveness check could oscillate between healthy/unhealthy unpredictably.

Suggestion: Consider using an atomic counter or ensuring the health loop only resets failures after a configurable number of successful probes.


2. Potential Data Loss: @connection Retries Write Operations (postgres.py:31-45)

The @connection decorator retries after OperationalError/InterfaceError, but insert_or_append could have partially committed before the connection error:

# In insert_or_append: connection commits successfully
self.connection.commit()  # <-- succeeds
# Then network drops here, raising OperationalError
# Decorator catches and retries entire operation
# Result: duplicate entry appended

Impact: Message duplication in conversation history after network glitches.

Suggestion: Add idempotency tracking or disable retry for write operations. The advisory lock helps within a transaction but doesn't prevent retry-induced duplicates.


3. Liveness Endpoint Accesses Private Attributes (health.py:139-140)

with cache._health_lock:
    failures = cache._consecutive_failures

Accessing _health_lock and _consecutive_failures directly violates encapsulation and creates tight coupling.

Suggestion: Add a public method like get_consecutive_failures() or is_healthy_for_liveness(threshold: int) to PostgresCache.


Medium Issues

4. Health Check Thread Has No Graceful Shutdown (postgres_cache.py:217-244)

def _health_check_loop(self) -> None:
    while True:  # No exit condition
        time.sleep(self._health_check_interval)

The daemon thread runs forever with no way to stop it cleanly. While daemon=True means it won't block process exit, this prevents clean shutdown in tests and can cause issues with connection cleanup.

Suggestion: Add a _shutdown_event = threading.Event() and check it in the loop.


5. Health Check Doesn't Run Immediately (postgres_cache.py:219-220)

def _health_check_loop(self) -> None:
    while True:
        time.sleep(self._health_check_interval)  # Sleeps FIRST
        # ... then checks

Initial health status is True but the first actual check happens after health_check_interval seconds (default 30s). If the DB is unreachable at startup, ready() returns True for 30 seconds.

Suggestion: Perform initial health check before sleeping, or set initial _health_status = False until first successful check.


6. Lock Timeout of 0 Causes Immediate Failure (postgres_cache.py:260-264)

def _acquire_lock(self) -> None:
    if not self._tx_lock.acquire(timeout=self._lock_timeout):
        raise CacheError("lock acquisition timeout")

If lock_timeout is configured as 0, every cache operation will fail immediately. The NonNegativeInt validation allows 0.

Suggestion: Either use PositiveInt for lock_timeout in config, or handle 0 specially (e.g., use blocking=True for infinite wait).


7. Health Connection State Inconsistency (postgres_cache.py:190-210)

def _connect_health(self) -> None:
    if self._health_connection is not None:
        try:
            self._health_connection.close()
        except Exception:
            pass  # Old connection closed or failed
    # ... if connect() fails here, _health_connection is still the old (closed) object
    self._health_connection = psycopg2.connect(**connect_kwargs)

If psycopg2.connect() fails, _health_connection still references the old closed connection.

Suggestion: Set self._health_connection = None before attempting new connection.


Minor Issues

8. Hardcoded connect_timeout: 10 (postgres_cache.py:207)

The health check connection timeout is hardcoded to 10 seconds. This should be configurable via PostgresConfig.


9. No Test for Concurrent Access Under Lock Contention

The test suite doesn't verify behavior when multiple threads compete for _tx_lock with timeout enabled. Consider adding a stress test.


Summary

The auto-recovery implementation is a significant improvement over the previous state, but the thread safety and retry semantics need attention before production deployment. The most critical issues are:

  1. Race condition in failure counting
  2. Potential duplicate writes on retry
  3. Breaking encapsulation in liveness endpoint

@sriroopar

Copy link
Copy Markdown
Contributor Author

Adversarial Review - Updated Code (Follow-up)

Issues Addressed ✅

Great progress! The following issues from my previous review have been addressed:

  1. Encapsulation - Now using cache.consecutive_failures property instead of private attributes
  2. Graceful shutdown - Added shutdown() method and _shutdown_event
  3. Configurable connect_timeout - Added health_check_connect_timeout to PostgresConfig
  4. Health check runs immediately - Loop now checks first, then waits at end
  5. Initial connect in try block - Moved inside try in @connection decorator

Remaining Issues

1. Health Connection State Inconsistency (postgres_cache.py:191-211)

Still not addressed:

def _connect_health(self) -> None:
    if self._health_connection is not None:
        try:
            self._health_connection.close()
        except Exception:
            pass
    # If psycopg2.connect() fails here, _health_connection still references
    # the old (closed) connection object
    self._health_connection = psycopg2.connect(**connect_kwargs)

Fix: Set self._health_connection = None after closing and before attempting new connection:

if self._health_connection is not None:
    try:
        self._health_connection.close()
    except Exception as close_err:
        logger.debug("Failed to close old health connection: %s", close_err)
    self._health_connection = None  # <-- Add this

2. shutdown() Doesn't Clean Up Resources (postgres_cache.py:218-220)

def shutdown(self) -> None:
    """Signal the health-check thread to stop."""
    self._shutdown_event.set()

The method only signals shutdown but doesn't:

  • Wait for the thread to finish (join())
  • Close _health_connection
  • Close the main connection

Suggestion:

def shutdown(self, timeout: float = 5.0) -> None:
    """Signal the health-check thread to stop and clean up."""
    self._shutdown_event.set()
    self._health_thread.join(timeout=timeout)
    if self._health_connection:
        try:
            self._health_connection.close()
        except Exception:
            pass

3. Lock Timeout of 0 Still Causes Immediate Failure (postgres_cache.py:265-269)

lock_timeout uses NonNegativeInt which allows 0. If configured as 0, every cache operation fails immediately.

Suggestion: Either:

  • Use PositiveInt for lock_timeout in config, OR
  • Handle 0 specially to mean "infinite wait":
def _acquire_lock(self) -> None:
    timeout = self._lock_timeout if self._lock_timeout > 0 else None
    if not self._tx_lock.acquire(timeout=timeout):
        raise CacheError("lock acquisition timeout")

4. Startup Race: First Health Check May Increment Failures (postgres_cache.py:222-249)

On first loop iteration, _health_connection is None, triggering _connect_health(). If DB is unreachable at startup, this increments _consecutive_failures immediately, potentially triggering liveness failure before any real operations have been attempted.

This may be acceptable behavior, but worth documenting.


Minor Observations

5. Potential Duplicate Writes on Retry (Unchanged)

This is documented as a known trade-off. The advisory lock helps within a transaction, but the @connection retry could still cause duplicates if the commit succeeds but connection drops immediately after. Consider adding a note in the docstring.


Summary

The code is much improved! The main remaining issues are:

  1. Health connection state - should be set to None before reconnect attempt
  2. shutdown() cleanup - should join thread and close connections
  3. Lock timeout edge case - 0 value handling

None of these are blockers, but fixing #1 and #2 would improve robustness.

- Fix encapsulation: add public consecutive_failures property, remove
  direct access to private _health_lock/_consecutive_failures from
  liveness endpoint
- Fix initial connection bypass: move connect() inside try/except in
  @connection decorator so failures get _mark_unhealthy() and CacheError
- Make health check connect timeout configurable via PostgresConfig
- Add graceful shutdown: shutdown() joins thread and closes connection
- Fix delayed initial check: run health probe before sleep, use
  Event.wait() instead of time.sleep()
- Fix startup race: skip failure counter on first health check to avoid
  premature liveness failures
- Fix stale connection reference: set _health_connection = None after
  close before reconnect
- Reject zero lock_timeout: change from NonNegativeInt to PositiveInt

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ols/src/cache/postgres_cache.py`:
- Around line 219-230: The Postgres cache shutdown path is implemented in
PostgresCache.shutdown() but is never invoked by the app lifecycle. Update the
app owner in main application startup/shutdown flow to call
config.conversation_cache.shutdown() during process teardown so the health
thread is joined and the connection is closed. Use the existing
CacheFactory-created cache instance and wire the call into the app
shutdown/lifespan handling in ols/app/main.py.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b907267e-0034-4eef-b88d-2b21f0a353dc

📥 Commits

Reviewing files that changed from the base of the PR and between 7e46a50 and f7d7700.

📒 Files selected for processing (9)
  • ols/app/endpoints/health.py
  • ols/app/models/config.py
  • ols/app/models/models.py
  • ols/constants.py
  • ols/src/cache/postgres_cache.py
  • ols/utils/postgres.py
  • tests/unit/app/endpoints/test_health.py
  • tests/unit/cache/test_postgres_cache.py
  • tests/unit/utils/test_postgres.py
✅ Files skipped from review due to trivial changes (1)
  • ols/constants.py
🚧 Files skipped from review as they are similar to previous changes (7)
  • ols/app/models/models.py
  • tests/unit/app/endpoints/test_health.py
  • ols/app/endpoints/health.py
  • ols/utils/postgres.py
  • tests/unit/cache/test_postgres_cache.py
  • ols/app/models/config.py
  • tests/unit/utils/test_postgres.py

Comment on lines +219 to +230
def shutdown(self) -> None:
"""Stop the health-check thread and close its connection."""
self._shutdown_event.set()
self._health_thread.join(timeout=self._health_check_connect_timeout)
if self._health_connection is not None:
try:
self._health_connection.close()
except Exception as close_err:
logger.debug(
"Failed to close health connection on shutdown: %s", close_err
)
self._health_connection = None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find any callers of PostgresCache.shutdown() outside its definition/tests.
rg -nP '\.shutdown\s*\(' --type=py -g '!**/test*/**' -C3
# Also check lifespan/atexit hooks that might wire cache cleanup.
rg -nP '(lifespan|atexit|on_event\(.?shutdown|register\()' --type=py -C2

Repository: openshift/lightspeed-service

Length of output: 166


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate the cache class and its usages.
git ls-files | rg '^ols/src/cache/postgres_cache\.py$|^ols/src/.*\.py$|^.*(main|app|lifespan|startup|shutdown|atexit).*\.(py|ts|js)$' || true

printf '\n## PostgresCache definition/usages\n'
rg -n "class PostgresCache|PostgresCache\(" ols/src -g '*.py' -C 2 || true

printf '\n## shutdown() callers\n'
rg -n "\.shutdown\s*\(" ols/src -g '*.py' -C 2 || true

printf '\n## lifecycle hooks / cleanup wiring\n'
rg -n "(lifespan|atexit|on_event\(.?shutdown|register\(|shutdown_event|startup_event|app\.on_event|FastAPI\()" ols/src -g '*.py' -C 2 || true

Repository: openshift/lightspeed-service

Length of output: 6600


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Inspect the cache module and nearby app/bootstrap files for cleanup wiring.
ast-grep outline ols/src/cache/postgres_cache.py || true
printf '\n## postgres_cache.py excerpt\n'
sed -n '1,280p' ols/src/cache/postgres_cache.py | cat -n

printf '\n## repo files mentioning cache module\n'
rg -n "postgres_cache|PostgresCache|cache" ols/src -g '*.py' -C 2 || true

Repository: openshift/lightspeed-service

Length of output: 50386


🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Find application bootstrap/lifespan files in the repo.
fd -a -t f -e py 'app|main|lifespan|startup|shutdown|fastapi|asgi' ols || true

Repository: openshift/lightspeed-service

Length of output: 414


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '## app/main.py outline\n'
ast-grep outline ols/app/main.py || true

printf '\n## app/main.py relevant lines\n'
rg -n "FastAPI|lifespan|startup|shutdown|conversation_cache|CacheFactory|Cache|router|include_router|on_event|atexit" ols/app/main.py -C 3 || true

printf '\n## config/cache ownership\n'
rg -n "conversation_cache|CacheFactory|cache_factory|PostgresCache|InMemoryCache|cache:" ols/app/models/config.py ols/src/cache/cache_factory.py ols/app -g '*.py' -C 3 || true

Repository: openshift/lightspeed-service

Length of output: 15578


🏁 Script executed:

#!/bin/bash
set -euo pipefail

printf '## app/models/config.py outline\n'
ast-grep outline ols/app/models/config.py || true

printf '\n## cache-related definitions in config.py\n'
rg -n "conversation_cache|CacheFactory|cache|shutdown|lifecycle|lifespan" ols/app/models/config.py -C 4 || true

printf '\n## cache factory file\n'
ast-grep outline ols/src/cache/cache_factory.py || true
sed -n '1,220p' ols/src/cache/cache_factory.py | cat -n

Repository: openshift/lightspeed-service

Length of output: 13311


Wire PostgresCache.shutdown() into app shutdown

ols/app/main.py has no lifespan or shutdown hook, and CacheFactory only constructs the cache. Call config.conversation_cache.shutdown() from the app owner so the health thread and connection are closed before process exit.

🧰 Tools
🪛 Ruff (0.15.20)

[warning] 226-226: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ols/src/cache/postgres_cache.py` around lines 219 - 230, The Postgres cache
shutdown path is implemented in PostgresCache.shutdown() but is never invoked by
the app lifecycle. Update the app owner in main application startup/shutdown
flow to call config.conversation_cache.shutdown() during process teardown so the
health thread is joined and the connection is closed. Use the existing
CacheFactory-created cache instance and wire the call into the app
shutdown/lifespan handling in ols/app/main.py.

- Update test_liveness assertion to include reason: null field
  added to LivenessResponse model
- Regenerate docs/openapi.json to reflect config and model changes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown

@sriroopar: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-ols-cluster cdb26ca link true /test e2e-ols-cluster

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant