LCORE-1859: Enhance /readiness endpoint with degraded mode reporting by anik120 · Pull Request #1947 · lightspeed-core/lightspeed-stack

anik120 · 2026-06-18T14:56:11Z

Description

#1781 introduced "degraded mode support" - the ability to start lightspeed-stack and keep it running even when llama-stack server might not be available.

This PR adds comprehensive degraded mode status reporting to the /readiness endpoint while maintaining clean API boundaries and Kubernetes probe semantics.

Enhanced HealthStatus enum with DEGRADED and UNHEALTHY service-level statuses while preserving provider-level statuses (OK, ERROR, NOT_IMPLEMENTED, UNKNOWN)
Enhanced /readiness endpoint to return 200 (ready=true) in degraded mode following Kubernetes semantics; only returns 503 when truly unhealthy
Refactored to avoid leaking implementation details in API responses:
- Removed llama_stack field from ReadinessResponse
- Removed Llama Stack version tracking from DegradedModeTracker
- Focus on functional impacts rather than internal technology stack

This design keeps internal implementation details (Llama Stack) private while exposing clear functional impacts to API consumers.

Type of change

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

Assisted-by: (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

Related Issue #
Closes #

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Please provide detailed steps to perform tests related to this code change.
How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

New Features
- Service now tracks and reports degraded mode operation with impact messaging when dependencies are unavailable.
- Enhanced readiness probe returns overall health status (healthy, degraded, or unhealthy) and describes functional limitations.
- Improved provider health and default model availability checks with clearer error reporting.
Documentation
- Updated API documentation to reflect new health status fields and degraded mode behavior.

coderabbitai · 2026-06-18T15:01:33Z

Walkthrough

Introduces a DegradedModeTracker singleton to track Lightspeed Core Stack connectivity state. Expands HealthStatus with DEGRADED and UNHEALTHY members, adds overall_status and impacts fields to ReadinessResponse, wires tracker updates into app startup, and branches the /readiness endpoint on degraded state. OpenAPI spec and tests are updated accordingly.

Changes

Degraded Mode Tracking and Readiness Enrichment

Layer / File(s)	Summary
HealthStatus enum expansion and ReadinessResponse schema `src/models/common/health.py`, `src/models/common/__init__.py`, `src/models/api/responses/successful/probes.py`	`HealthStatus` gains `DEGRADED` and `UNHEALTHY` enum members with updated docstring. `ReadinessResponse` gains `overall_status: HealthStatus` and `impacts: Optional[list[str]]` fields, updated field metadata, and two OpenAPI schema examples.
DegradedModeTracker singleton `src/utils/degraded_mode.py`, `tests/unit/test_degraded_mode.py`	New singleton class with `_is_degraded` and `_degraded_reason` state; exposes `set_degraded`, `set_healthy`, `is_degraded`, and `get_degraded_reason`. Tests cover initial state, transitions, and singleton identity.
App startup degraded-mode wiring `src/app/main.py`	Imports `DegradedModeTracker` and calls `set_degraded` when `check_llama_stack_version` returns `None` or raises `APIConnectionError`, and `set_healthy` on success.
/readiness handler branching and richer responses `src/app/endpoints/health.py`	Handler checks `DegradedModeTracker.is_degraded()` first, returning HTTP 200 with `overall_status=DEGRADED` early. Otherwise checks provider health (distinguishing connection errors from provider errors) and default model availability, returning `overall_status=UNHEALTHY` with `impacts` on failures or `overall_status=HEALTHY` on success.
Readiness endpoint and response model tests `tests/unit/app/endpoints/test_health.py`, `tests/unit/models/responses/test_successful_responses.py`	Existing readiness tests gain `DegradedModeTracker` mocks and assertions on `overall_status` and `impacts`. New `TestReadinessDegradedMode` tests the degraded-mode path. `ReadinessResponse` unit tests pass `overall_status` on construction and assert 2 OpenAPI examples.
OpenAPI spec `docs/openapi.json`	Adds `HealthStatus` enum schema, expands `ReadinessResponse` with `overall_status` and `impacts`, updates `required` list and examples, and rewrites the readiness probe operation description to document degraded-mode behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant ReadinessEndpoint as GET /readiness
  participant DegradedModeTracker
  participant ProviderHealth as get_providers_health_statuses
  participant ModelCheck as check_default_model_available

  Client->>ReadinessEndpoint: GET /readiness
  ReadinessEndpoint->>DegradedModeTracker: is_degraded()
  alt degraded
    DegradedModeTracker-->>ReadinessEndpoint: True + reason
    ReadinessEndpoint-->>Client: 200 overall_status=DEGRADED, ready=True, impacts=[...]
  else healthy
    DegradedModeTracker-->>ReadinessEndpoint: False
    ReadinessEndpoint->>ProviderHealth: get_providers_health_statuses()
    ProviderHealth-->>ReadinessEndpoint: provider list
    alt unhealthy providers
      ReadinessEndpoint-->>Client: 503 overall_status=UNHEALTHY, ready=False, impacts=[...]
    else providers OK
      ReadinessEndpoint->>ModelCheck: check_default_model_available()
      ModelCheck-->>ReadinessEndpoint: available or not
      alt model unavailable
        ReadinessEndpoint-->>Client: 503 overall_status=UNHEALTHY, ready=False, impacts=[...]
      else model available
        ReadinessEndpoint-->>Client: 200 overall_status=HEALTHY, ready=True, impacts=null
      end
    end
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

lightspeed-core/lightspeed-stack#1781: Implements the same degraded-mode startup wiring in src/app/main.py around check_llama_stack_version and APIConnectionError handling that this PR builds upon.

Suggested reviewers

tisnik

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: enhancing the /readiness endpoint with degraded mode reporting capability, which aligns with the comprehensive changes across multiple files.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

anik120 · 2026-06-18T15:20:43Z

Here's what the new response looks like:

$ curl http://localhost:8080/readiness | jq
{
  "ready": true,
  "reason": "All providers are healthy",
  "overall_status": "healthy",
  "impacts": null,
  "providers": []
}

When llama-stack is down:

$ curl http://localhost:8080/readiness | jq
{
  "ready": false,
  "reason": "Cannot connect to backend service",
  "overall_status": "unhealthy",
  "impacts": [
    "LLM inference unavailable",
    "Provider health checks unavailable"
  ],
  "providers": []
}

lightspeed-core#1781 introduced "degraded mode support" - the ability to start lightspeed-stack and keep it running even when llama-stack server might not be available. This PR adds comprehensive degraded mode status reporting to the /readiness endpoint while maintaining clean API boundaries and Kubernetes probe semantics. - Enhanced HealthStatus enum with DEGRADED and UNHEALTHY service-level statuses while preserving provider-level statuses (OK, ERROR, NOT_IMPLEMENTED, UNKNOWN) - Enhanced /readiness endpoint to return 200 (ready=true) in degraded mode following Kubernetes semantics; only returns 503 when truly unhealthy - Refactored to avoid leaking implementation details in API responses: * Removed llama_stack field from ReadinessResponse * Removed Llama Stack version tracking from DegradedModeTracker * Focus on functional impacts rather than internal technology stack This design keeps internal implementation details (Llama Stack) private while exposing clear functional impacts to API consumers. Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>

coderabbitai

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

docs/openapi.json (1)
9964-17967: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Regenerate the OpenAPI schema to fix CI pipeline failure.

The CI pipeline is failing because docs/openapi.json is out of sync with the generated schema. This file should not be manually edited but regenerated from the source Pydantic models.

Run the following command to regenerate:
uv run scripts/generate_openapi_schema.py docs/openapi.json
This will ensure consistency between the Python model definitions and the OpenAPI specification.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/openapi.json` around lines 9964 - 17967, The OpenAPI JSON is stale and
no longer matches the generated schema, causing CI to fail. Regenerate
docs/openapi.json from the source models instead of editing it manually by
running scripts/generate_openapi_schema.py, and verify the updated output is
consistent with the Pydantic schemas that define endpoints like
readiness_probe_get_method_readiness_get,
liveness_probe_get_method_liveness_get, and
authorized_endpoint_handler_authorized_post.
Source: Pipeline failures
tests/unit/test_degraded_mode.py (1)
9-60: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent test order dependence from shared singleton state.

These tests mutate a process-wide singleton without resetting it between tests, so they can become flaky when run with different ordering or alongside other tests that touch DegradedModeTracker.
Suggested patch
+import pytest
 from utils.degraded_mode import DegradedModeTracker
 
 
+@pytest.fixture(autouse=True)
+def reset_degraded_mode_tracker() -> None:
+    tracker = DegradedModeTracker()
+    tracker.set_healthy()
+    yield
+    tracker.set_healthy()
+
+
 class TestDegradedModeTracker:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_degraded_mode.py` around lines 9 - 60, The tests are mutating
the shared DegradedModeTracker singleton state without resetting it between
tests, which causes test order dependence and flakiness. Add a pytest fixture
with autouse=True that resets the DegradedModeTracker singleton to its initial
healthy state (either by calling set_healthy() or by clearing the internal
state) after each test completes. Alternatively, add a setUp or tearDown method
to the test class that resets the singleton state before or after each test
method.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/openapi.json`:
- Around line 13489-13502: The HealthStatus enum in src/models/common/health.py
has inconsistent casing where the "Error" value is capitalized while all other
enum values like "ok", "not_implemented", "unknown", "healthy", "degraded", and
"unhealthy" are lowercase. Either change the "Error" value to "error" to match
the lowercase convention used throughout the enum for consistency, or if this
capitalization is required for backwards compatibility with Llama Stack provider
responses, add explicit documentation to the HealthStatus enum docstring
explaining this dependency and why it differs from the other values.

In `@src/app/endpoints/health.py`:
- Line 151: Change the logger.info() call that logs "Response to /readiness
endpoint" to logger.debug() instead, since the readiness probe is called
frequently by orchestrators and logging at info level creates excessive noise in
the logs. Per coding guidelines, debug() should be used for diagnostic
information while info() should be reserved for significant program execution
events.
- Around line 159-163: The impact message strings ("LLM inference unavailable",
"RAG functionality unavailable", "Agent tools unavailable") are duplicated
across multiple locations in the health.py file (at lines 159-163, 186-189, and
214-214). Extract these string literals to shared constants in constants.py
(following the coding guideline to check constants.py before defining new ones),
then replace all hardcoded occurrences in the health endpoints with references
to these constants to ensure consistency and prevent message drift between
degraded and unhealthy responses.

In `@src/app/main.py`:
- Around line 91-95: When llama_stack_version is None, the code currently only
sets degraded mode if allow_degraded_mode is True, but fails to handle the case
where allow_degraded_mode is False. Add an else clause after the
degraded_tracker.set_degraded call to fail the startup process (using sys.exit
or raising an exception) when allow_degraded_mode is False, ensuring that
startup terminates when the connectivity check fails and degraded mode is
explicitly disabled.

In `@src/utils/degraded_mode.py`:
- Around line 12-53: The class DegradedModeTracker and its methods need to
follow Google Python docstring conventions more strictly. Add an Attributes
section to the class docstring that documents the _is_degraded and
_degraded_reason instance variables with their types and descriptions. For each
method (set_degraded, set_healthy, is_degraded, get_degraded_reason), ensure the
docstrings include complete Google-style sections: Parameters (if applicable),
Returns, and Raises (if applicable). The set_degraded method should include a
Parameters section, while is_degraded and get_degraded_reason already have
Returns sections but may need Raises if exceptions can be thrown.

In `@tests/unit/app/endpoints/test_health.py`:
- Around line 382-396: The test for readiness_probe_get_method in degraded mode
is checking response payload fields but not verifying the HTTP status code
contract. Replace the mocker.Mock() object with a real Response object, and add
an assertion to verify that the response status code is 200 to ensure the probe
correctly returns HTTP 200 even in degraded mode.

---

Outside diff comments:
In `@docs/openapi.json`:
- Around line 9964-17967: The OpenAPI JSON is stale and no longer matches the
generated schema, causing CI to fail. Regenerate docs/openapi.json from the
source models instead of editing it manually by running
scripts/generate_openapi_schema.py, and verify the updated output is consistent
with the Pydantic schemas that define endpoints like
readiness_probe_get_method_readiness_get,
liveness_probe_get_method_liveness_get, and
authorized_endpoint_handler_authorized_post.

In `@tests/unit/test_degraded_mode.py`:
- Around line 9-60: The tests are mutating the shared DegradedModeTracker
singleton state without resetting it between tests, which causes test order
dependence and flakiness. Add a pytest fixture with autouse=True that resets the
DegradedModeTracker singleton to its initial healthy state (either by calling
set_healthy() or by clearing the internal state) after each test completes.
Alternatively, add a setUp or tearDown method to the test class that resets the
singleton state before or after each test method.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b244f354-9638-4b87-ab69-29115287d11b

📥 Commits

Reviewing files that changed from the base of the PR and between 1f7d8ac and 0a3e946.

📒 Files selected for processing (10)

docs/openapi.json
src/app/endpoints/health.py
src/app/main.py
src/models/api/responses/successful/probes.py
src/models/common/__init__.py
src/models/common/health.py
src/utils/degraded_mode.py
tests/unit/app/endpoints/test_health.py
tests/unit/models/responses/test_successful_responses.py
tests/unit/test_degraded_mode.py

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)

GitHub Check: build-pr
GitHub Check: integration_tests (3.12)
GitHub Check: integration_tests (3.13)
GitHub Check: E2E: library mode / ci / group 2
GitHub Check: E2E: server mode / ci / group 2
GitHub Check: E2E: library mode / ci / group 3
GitHub Check: E2E: server mode / ci / group 1
GitHub Check: E2E: library mode / ci / group 1
GitHub Check: E2E: server mode / ci / group 3
GitHub Check: E2E Tests for Lightspeed Evaluation job
GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-0-6-on-pull-request

🧰 Additional context used

📓 Path-based instructions (5)

src/**/*.py