Skip to content

LCORE-1859: Enhance /readiness endpoint with degraded mode reporting#1947

Open
anik120 wants to merge 1 commit into
lightspeed-core:mainfrom
anik120:degrade-mode-report-health
Open

LCORE-1859: Enhance /readiness endpoint with degraded mode reporting#1947
anik120 wants to merge 1 commit into
lightspeed-core:mainfrom
anik120:degrade-mode-report-health

Conversation

@anik120

@anik120 anik120 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Description

#1781 introduced "degraded mode support" - the ability to start lightspeed-stack and keep it running even when llama-stack server might not be available.

This PR adds comprehensive degraded mode status reporting to the /readiness endpoint while maintaining clean API boundaries and Kubernetes probe semantics.

  • Enhanced HealthStatus enum with DEGRADED and UNHEALTHY service-level statuses while preserving provider-level statuses (OK, ERROR, NOT_IMPLEMENTED, UNKNOWN)

  • Enhanced /readiness endpoint to return 200 (ready=true) in degraded mode following Kubernetes semantics; only returns 503 when truly unhealthy

  • Refactored to avoid leaking implementation details in API responses:

    • Removed llama_stack field from ReadinessResponse
    • Removed Llama Stack version tracking from DegradedModeTracker
    • Focus on functional impacts rather than internal technology stack

This design keeps internal implementation details (Llama Stack) private while exposing clear functional impacts to API consumers.

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement
  • Benchmarks improvement

Tools used to create PR

Identify any AI code assistants used in this PR (for transparency and review context)

  • Assisted-by: (e.g., Claude, CodeRabbit, Ollama, etc., N/A if not used)
  • Generated by: (e.g., tool name and version; N/A if not used)

Related Tickets & Documents

  • Related Issue #
  • Closes #

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  • Please provide detailed steps to perform tests related to this code change.
  • How were the fix/results from this change verified? Please provide relevant screenshots or results.

Summary by CodeRabbit

  • New Features

    • Service now tracks and reports degraded mode operation with impact messaging when dependencies are unavailable.
    • Enhanced readiness probe returns overall health status (healthy, degraded, or unhealthy) and describes functional limitations.
    • Improved provider health and default model availability checks with clearer error reporting.
  • Documentation

    • Updated API documentation to reflect new health status fields and degraded mode behavior.

@anik120 anik120 force-pushed the degrade-mode-report-health branch from 9bd1ce3 to 5dae3bf Compare June 18, 2026 14:58
@coderabbitai

coderabbitai Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

Introduces a DegradedModeTracker singleton to track Lightspeed Core Stack connectivity state. Expands HealthStatus with DEGRADED and UNHEALTHY members, adds overall_status and impacts fields to ReadinessResponse, wires tracker updates into app startup, and branches the /readiness endpoint on degraded state. OpenAPI spec and tests are updated accordingly.

Changes

Degraded Mode Tracking and Readiness Enrichment

Layer / File(s) Summary
HealthStatus enum expansion and ReadinessResponse schema
src/models/common/health.py, src/models/common/__init__.py, src/models/api/responses/successful/probes.py
HealthStatus gains DEGRADED and UNHEALTHY enum members with updated docstring. ReadinessResponse gains overall_status: HealthStatus and impacts: Optional[list[str]] fields, updated field metadata, and two OpenAPI schema examples.
DegradedModeTracker singleton
src/utils/degraded_mode.py, tests/unit/test_degraded_mode.py
New singleton class with _is_degraded and _degraded_reason state; exposes set_degraded, set_healthy, is_degraded, and get_degraded_reason. Tests cover initial state, transitions, and singleton identity.
App startup degraded-mode wiring
src/app/main.py
Imports DegradedModeTracker and calls set_degraded when check_llama_stack_version returns None or raises APIConnectionError, and set_healthy on success.
/readiness handler branching and richer responses
src/app/endpoints/health.py
Handler checks DegradedModeTracker.is_degraded() first, returning HTTP 200 with overall_status=DEGRADED early. Otherwise checks provider health (distinguishing connection errors from provider errors) and default model availability, returning overall_status=UNHEALTHY with impacts on failures or overall_status=HEALTHY on success.
Readiness endpoint and response model tests
tests/unit/app/endpoints/test_health.py, tests/unit/models/responses/test_successful_responses.py
Existing readiness tests gain DegradedModeTracker mocks and assertions on overall_status and impacts. New TestReadinessDegradedMode tests the degraded-mode path. ReadinessResponse unit tests pass overall_status on construction and assert 2 OpenAPI examples.
OpenAPI spec
docs/openapi.json
Adds HealthStatus enum schema, expands ReadinessResponse with overall_status and impacts, updates required list and examples, and rewrites the readiness probe operation description to document degraded-mode behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant ReadinessEndpoint as GET /readiness
  participant DegradedModeTracker
  participant ProviderHealth as get_providers_health_statuses
  participant ModelCheck as check_default_model_available

  Client->>ReadinessEndpoint: GET /readiness
  ReadinessEndpoint->>DegradedModeTracker: is_degraded()
  alt degraded
    DegradedModeTracker-->>ReadinessEndpoint: True + reason
    ReadinessEndpoint-->>Client: 200 overall_status=DEGRADED, ready=True, impacts=[...]
  else healthy
    DegradedModeTracker-->>ReadinessEndpoint: False
    ReadinessEndpoint->>ProviderHealth: get_providers_health_statuses()
    ProviderHealth-->>ReadinessEndpoint: provider list
    alt unhealthy providers
      ReadinessEndpoint-->>Client: 503 overall_status=UNHEALTHY, ready=False, impacts=[...]
    else providers OK
      ReadinessEndpoint->>ModelCheck: check_default_model_available()
      ModelCheck-->>ReadinessEndpoint: available or not
      alt model unavailable
        ReadinessEndpoint-->>Client: 503 overall_status=UNHEALTHY, ready=False, impacts=[...]
      else model available
        ReadinessEndpoint-->>Client: 200 overall_status=HEALTHY, ready=True, impacts=null
      end
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • lightspeed-core/lightspeed-stack#1781: Implements the same degraded-mode startup wiring in src/app/main.py around check_llama_stack_version and APIConnectionError handling that this PR builds upon.

Suggested reviewers

  • tisnik
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: enhancing the /readiness endpoint with degraded mode reporting capability, which aligns with the comprehensive changes across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@anik120 anik120 force-pushed the degrade-mode-report-health branch from 5dae3bf to d70548e Compare June 18, 2026 15:16
@anik120

anik120 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Here's what the new response looks like:

$ curl http://localhost:8080/readiness | jq
{
  "ready": true,
  "reason": "All providers are healthy",
  "overall_status": "healthy",
  "impacts": null,
  "providers": []
}

When llama-stack is down:

$ curl http://localhost:8080/readiness | jq
{
  "ready": false,
  "reason": "Cannot connect to backend service",
  "overall_status": "unhealthy",
  "impacts": [
    "LLM inference unavailable",
    "Provider health checks unavailable"
  ],
  "providers": []
}

@anik120 anik120 force-pushed the degrade-mode-report-health branch from d70548e to 4d4a3a9 Compare June 18, 2026 15:21
lightspeed-core#1781 introduced
"degraded mode support" - the ability to start lightspeed-stack and keep it running
even when llama-stack server might not be available.

This PR adds comprehensive degraded mode status reporting to the /readiness endpoint
while maintaining clean API boundaries and Kubernetes probe semantics.

- Enhanced HealthStatus enum with DEGRADED and UNHEALTHY service-level statuses
while preserving provider-level statuses (OK, ERROR, NOT_IMPLEMENTED, UNKNOWN)

- Enhanced /readiness endpoint to return 200 (ready=true) in degraded mode
following Kubernetes semantics; only returns 503 when truly unhealthy

- Refactored to avoid leaking implementation details in API responses:
  * Removed llama_stack field from ReadinessResponse
  * Removed Llama Stack version tracking from DegradedModeTracker
  * Focus on functional impacts rather than internal technology stack

This design keeps internal implementation details (Llama Stack) private while
exposing clear functional impacts to API consumers.

Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
@anik120 anik120 force-pushed the degrade-mode-report-health branch from 4d4a3a9 to 0a3e946 Compare June 18, 2026 16:51

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
docs/openapi.json (1)

9964-17967: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Regenerate the OpenAPI schema to fix CI pipeline failure.

The CI pipeline is failing because docs/openapi.json is out of sync with the generated schema. This file should not be manually edited but regenerated from the source Pydantic models.

Run the following command to regenerate:

uv run scripts/generate_openapi_schema.py docs/openapi.json

This will ensure consistency between the Python model definitions and the OpenAPI specification.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/openapi.json` around lines 9964 - 17967, The OpenAPI JSON is stale and
no longer matches the generated schema, causing CI to fail. Regenerate
docs/openapi.json from the source models instead of editing it manually by
running scripts/generate_openapi_schema.py, and verify the updated output is
consistent with the Pydantic schemas that define endpoints like
readiness_probe_get_method_readiness_get,
liveness_probe_get_method_liveness_get, and
authorized_endpoint_handler_authorized_post.

Source: Pipeline failures

tests/unit/test_degraded_mode.py (1)

9-60: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Prevent test order dependence from shared singleton state.

These tests mutate a process-wide singleton without resetting it between tests, so they can become flaky when run with different ordering or alongside other tests that touch DegradedModeTracker.

Suggested patch
+import pytest
 from utils.degraded_mode import DegradedModeTracker
 
 
+@pytest.fixture(autouse=True)
+def reset_degraded_mode_tracker() -> None:
+    tracker = DegradedModeTracker()
+    tracker.set_healthy()
+    yield
+    tracker.set_healthy()
+
+
 class TestDegradedModeTracker:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/test_degraded_mode.py` around lines 9 - 60, The tests are mutating
the shared DegradedModeTracker singleton state without resetting it between
tests, which causes test order dependence and flakiness. Add a pytest fixture
with autouse=True that resets the DegradedModeTracker singleton to its initial
healthy state (either by calling set_healthy() or by clearing the internal
state) after each test completes. Alternatively, add a setUp or tearDown method
to the test class that resets the singleton state before or after each test
method.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/openapi.json`:
- Around line 13489-13502: The HealthStatus enum in src/models/common/health.py
has inconsistent casing where the "Error" value is capitalized while all other
enum values like "ok", "not_implemented", "unknown", "healthy", "degraded", and
"unhealthy" are lowercase. Either change the "Error" value to "error" to match
the lowercase convention used throughout the enum for consistency, or if this
capitalization is required for backwards compatibility with Llama Stack provider
responses, add explicit documentation to the HealthStatus enum docstring
explaining this dependency and why it differs from the other values.

In `@src/app/endpoints/health.py`:
- Line 151: Change the logger.info() call that logs "Response to /readiness
endpoint" to logger.debug() instead, since the readiness probe is called
frequently by orchestrators and logging at info level creates excessive noise in
the logs. Per coding guidelines, debug() should be used for diagnostic
information while info() should be reserved for significant program execution
events.
- Around line 159-163: The impact message strings ("LLM inference unavailable",
"RAG functionality unavailable", "Agent tools unavailable") are duplicated
across multiple locations in the health.py file (at lines 159-163, 186-189, and
214-214). Extract these string literals to shared constants in constants.py
(following the coding guideline to check constants.py before defining new ones),
then replace all hardcoded occurrences in the health endpoints with references
to these constants to ensure consistency and prevent message drift between
degraded and unhealthy responses.

In `@src/app/main.py`:
- Around line 91-95: When llama_stack_version is None, the code currently only
sets degraded mode if allow_degraded_mode is True, but fails to handle the case
where allow_degraded_mode is False. Add an else clause after the
degraded_tracker.set_degraded call to fail the startup process (using sys.exit
or raising an exception) when allow_degraded_mode is False, ensuring that
startup terminates when the connectivity check fails and degraded mode is
explicitly disabled.

In `@src/utils/degraded_mode.py`:
- Around line 12-53: The class DegradedModeTracker and its methods need to
follow Google Python docstring conventions more strictly. Add an Attributes
section to the class docstring that documents the _is_degraded and
_degraded_reason instance variables with their types and descriptions. For each
method (set_degraded, set_healthy, is_degraded, get_degraded_reason), ensure the
docstrings include complete Google-style sections: Parameters (if applicable),
Returns, and Raises (if applicable). The set_degraded method should include a
Parameters section, while is_degraded and get_degraded_reason already have
Returns sections but may need Raises if exceptions can be thrown.

In `@tests/unit/app/endpoints/test_health.py`:
- Around line 382-396: The test for readiness_probe_get_method in degraded mode
is checking response payload fields but not verifying the HTTP status code
contract. Replace the mocker.Mock() object with a real Response object, and add
an assertion to verify that the response status code is 200 to ensure the probe
correctly returns HTTP 200 even in degraded mode.

---

Outside diff comments:
In `@docs/openapi.json`:
- Around line 9964-17967: The OpenAPI JSON is stale and no longer matches the
generated schema, causing CI to fail. Regenerate docs/openapi.json from the
source models instead of editing it manually by running
scripts/generate_openapi_schema.py, and verify the updated output is consistent
with the Pydantic schemas that define endpoints like
readiness_probe_get_method_readiness_get,
liveness_probe_get_method_liveness_get, and
authorized_endpoint_handler_authorized_post.

In `@tests/unit/test_degraded_mode.py`:
- Around line 9-60: The tests are mutating the shared DegradedModeTracker
singleton state without resetting it between tests, which causes test order
dependence and flakiness. Add a pytest fixture with autouse=True that resets the
DegradedModeTracker singleton to its initial healthy state (either by calling
set_healthy() or by clearing the internal state) after each test completes.
Alternatively, add a setUp or tearDown method to the test class that resets the
singleton state before or after each test method.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b244f354-9638-4b87-ab69-29115287d11b

📥 Commits

Reviewing files that changed from the base of the PR and between 1f7d8ac and 0a3e946.

📒 Files selected for processing (10)
  • docs/openapi.json
  • src/app/endpoints/health.py
  • src/app/main.py
  • src/models/api/responses/successful/probes.py
  • src/models/common/__init__.py
  • src/models/common/health.py
  • src/utils/degraded_mode.py
  • tests/unit/app/endpoints/test_health.py
  • tests/unit/models/responses/test_successful_responses.py
  • tests/unit/test_degraded_mode.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
  • GitHub Check: build-pr
  • GitHub Check: integration_tests (3.12)
  • GitHub Check: integration_tests (3.13)
  • GitHub Check: E2E: library mode / ci / group 2
  • GitHub Check: E2E: server mode / ci / group 2
  • GitHub Check: E2E: library mode / ci / group 3
  • GitHub Check: E2E: server mode / ci / group 1
  • GitHub Check: E2E: library mode / ci / group 1
  • GitHub Check: E2E: server mode / ci / group 3
  • GitHub Check: E2E Tests for Lightspeed Evaluation job
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-0-6-on-pull-request
🧰 Additional context used
📓 Path-based instructions (5)
src/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/**/*.py: Use absolute imports for internal modules: from authentication import get_auth_dependency
Llama Stack imports: Use from llama_stack_client import AsyncLlamaStackClient
Check constants.py for shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Use logger = get_logger(__name__) from log.py for module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Use async def for I/O operations and external API calls
Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes: Configuration, Error/Exception, Resolver, Interface
Abstract classes must use ABC with @abstractmethod decorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes

Files:

  • src/utils/degraded_mode.py
  • src/models/common/__init__.py
  • src/app/main.py
  • src/models/api/responses/successful/probes.py
  • src/app/endpoints/health.py
  • src/models/common/health.py
src/**/__init__.py

📄 CodeRabbit inference engine (AGENTS.md)

Package __init__.py files must contain brief package descriptions

Files:

  • src/models/common/__init__.py
src/models/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Pydantic models must use @model_validator and @field_validator for validation and complete type annotations for all attributes, avoiding Any type

Files:

  • src/models/common/__init__.py
  • src/models/api/responses/successful/probes.py
  • src/models/common/health.py
src/app/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

src/app/**/*.py: FastAPI dependencies: Import from fastapi module for APIRouter, HTTPException, Request, status, Depends
Use FastAPI HTTPException with appropriate status codes for API endpoints and handle APIConnectionError from Llama Stack

Files:

  • src/app/main.py
  • src/app/endpoints/health.py
tests/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Use pytest.mark.asyncio marker for async tests

Files:

  • tests/unit/test_degraded_mode.py
  • tests/unit/app/endpoints/test_health.py
  • tests/unit/models/responses/test_successful_responses.py
🧠 Learnings (3)
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.

Applied to files:

  • src/models/common/__init__.py
  • src/models/api/responses/successful/probes.py
  • src/models/common/health.py
📚 Learning: 2026-02-25T07:46:33.545Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.

Applied to files:

  • src/models/common/__init__.py
  • src/models/api/responses/successful/probes.py
  • src/models/common/health.py
📚 Learning: 2026-04-06T20:18:07.852Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.

Applied to files:

  • src/app/endpoints/health.py
🪛 GitHub Actions: OpenAPI (Spectral) / 0_spectral.txt
docs/openapi.json

[error] 1-1: CI failed because docs/openapi.json is out of date. Diff against regenerated schema /tmp/openapi-generated.json returned non-zero. Regenerate with: uv run scripts/generate_openapi_schema.py docs/openapi.json

🪛 GitHub Actions: OpenAPI (Spectral) / spectral
docs/openapi.json

[error] 1-1: OpenAPI schema mismatch: docs/openapi.json differs from generated schema (/tmp/openapi-generated.json).

🔇 Additional comments (6)
src/models/common/health.py (1)

10-35: LGTM!

Also applies to: 57-57

src/models/common/__init__.py (1)

10-13: LGTM!

src/models/api/responses/successful/probes.py (1)

3-11: LGTM!

Also applies to: 52-123

docs/openapi.json (1)

17948-17965: Example payloads align with implementation and test expectations.

The two schema examples (healthy and degraded states) correctly match:

  1. The healthy example omits impacts (implicitly null), has overall_status: "healthy", and empty providers
  2. The degraded example includes impacts array, overall_status: "degraded", and empty providers

This aligns with the endpoint implementation in src/app/endpoints/health.py and the test assertion expecting exactly 2 examples.

LGTM!

tests/unit/app/endpoints/test_health.py (1)

17-20: LGTM!

Also applies to: 31-35, 59-61, 71-75, 110-112, 123-127, 155-157

tests/unit/models/responses/test_successful_responses.py (1)

38-38: LGTM!

Also applies to: 416-425, 435-442, 461-462, 472-473

Comment thread docs/openapi.json
Comment on lines +13489 to +13502
"HealthStatus": {
"type": "string",
"enum": [
"ok",
"Error",
"not_implemented",
"unknown",
"healthy",
"degraded",
"unhealthy"
],
"title": "HealthStatus",
"description": "Health status enum for provider and service health checks.\n\nThis enum serves two purposes:\n\n1. Provider-level health (returned by Llama Stack providers):\n - OK: Provider is healthy and operational\n - ERROR: Provider is unhealthy or failed health check\n - NOT_IMPLEMENTED: Provider does not implement health checks\n - UNKNOWN: Fallback when provider status cannot be determined\n\n2. Service-level health (overall LCORE status):\n - HEALTHY: All systems operational, LLS connected, all providers healthy\n - DEGRADED: Service running with reduced functionality (e.g., LLS unavailable)\n - UNHEALTHY: Service connected but one or more providers are unhealthy"
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check how HealthStatus enum is defined in source code to verify casing is intentional

# Find the HealthStatus enum definition
ast-grep --pattern $'class HealthStatus($$$):
  $$$'

# Also search for any usage of HealthStatus.ERROR to understand the casing
rg -n "HealthStatus\." --type py -C2

Repository: lightspeed-core/lightspeed-stack

Length of output: 13185


Address the mixed casing in the HealthStatus enum values.

The HealthStatus enum in src/models/common/health.py defines ERROR = "Error" with capitalization while all other values ("ok", "not_implemented", "unknown", "healthy", "degraded", "unhealthy") are lowercase.

If this capitalization is required for backwards compatibility with Llama Stack provider responses, add explicit documentation to the enum docstring explaining the dependency. Otherwise, normalize all values to lowercase for consistency across the API contract.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/openapi.json` around lines 13489 - 13502, The HealthStatus enum in
src/models/common/health.py has inconsistent casing where the "Error" value is
capitalized while all other enum values like "ok", "not_implemented", "unknown",
"healthy", "degraded", and "unhealthy" are lowercase. Either change the "Error"
value to "error" to match the lowercase convention used throughout the enum for
consistency, or if this capitalization is required for backwards compatibility
with Llama Stack provider responses, add explicit documentation to the
HealthStatus enum docstring explaining this dependency and why it differs from
the other values.

_ = auth

logger.info("Response to /v1/readiness endpoint")
logger.info("Response to /readiness endpoint")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use debug() for the readiness probe trace log.

/readiness is hit frequently by orchestrators, so emitting this at info will create noisy logs and reduce operational signal quality.

As per coding guidelines, "Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/app/endpoints/health.py` at line 151, Change the logger.info() call that
logs "Response to /readiness endpoint" to logger.debug() instead, since the
readiness probe is called frequently by orchestrators and logging at info level
creates excessive noise in the logs. Per coding guidelines, debug() should be
used for diagnostic information while info() should be reserved for significant
program execution events.

Source: Coding guidelines

Comment on lines +159 to +163
impacts = [
"LLM inference unavailable",
"RAG functionality unavailable",
"Agent tools unavailable",
]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Extract readiness impact strings to shared constants.

These impact messages are duplicated across branches. Centralizing them avoids message drift between degraded/unhealthy responses and keeps API text consistent.

As per coding guidelines, "Check constants.py for shared constants before defining new ones."

Also applies to: 186-189, 214-214

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/app/endpoints/health.py` around lines 159 - 163, The impact message
strings ("LLM inference unavailable", "RAG functionality unavailable", "Agent
tools unavailable") are duplicated across multiple locations in the health.py
file (at lines 159-163, 186-189, and 214-214). Extract these string literals to
shared constants in constants.py (following the coding guideline to check
constants.py before defining new ones), then replace all hardcoded occurrences
in the health endpoints with references to these constants to ensure consistency
and prevent message drift between degraded and unhealthy responses.

Source: Coding guidelines

Comment thread src/app/main.py
Comment on lines 91 to 95
if llama_stack_version is None:
logger.error("Cannot retrieve Llama Stack version, check connection")
if llama_stack_config.allow_degraded_mode:
degraded_tracker.set_degraded("Llama Stack connection check failed")
else:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail startup when connectivity check fails and degraded mode is disabled.

If version retrieval fails in this branch, startup currently continues unless degraded mode is enabled. That allows running in an unintended unhealthy state when degraded mode is explicitly off.

Suggested patch
         if llama_stack_version is None:
             logger.error("Cannot retrieve Llama Stack version, check connection")
             if llama_stack_config.allow_degraded_mode:
                 degraded_tracker.set_degraded("Llama Stack connection check failed")
+            else:
+                raise RuntimeError(
+                    "Cannot retrieve Llama Stack version while degraded mode is disabled"
+                )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/app/main.py` around lines 91 - 95, When llama_stack_version is None, the
code currently only sets degraded mode if allow_degraded_mode is True, but fails
to handle the case where allow_degraded_mode is False. Add an else clause after
the degraded_tracker.set_degraded call to fail the startup process (using
sys.exit or raising an exception) when allow_degraded_mode is False, ensuring
that startup terminates when the connectivity check fails and degraded mode is
explicitly disabled.

Comment on lines +12 to +53
class DegradedModeTracker(metaclass=Singleton):
"""Track degraded mode state for Lightspeed Core Stack.

When LCORE cannot connect to Llama Stack during startup and
allow_degraded_mode is enabled, the service enters degraded mode.
This tracker maintains that state for health reporting.
"""

def __init__(self) -> None:
"""Initialize the degraded mode tracker."""
self._is_degraded: bool = False
self._degraded_reason: Optional[str] = None

def set_degraded(self, reason: str) -> None:
"""Mark the service as running in degraded mode.

Parameters:
reason: Description of why degraded mode was entered.
"""
self._is_degraded = True
self._degraded_reason = reason

def set_healthy(self) -> None:
"""Mark the service as running in healthy mode."""
self._is_degraded = False
self._degraded_reason = None

def is_degraded(self) -> bool:
"""Check if the service is running in degraded mode.

Returns:
True if service is in degraded mode, False otherwise.
"""
return self._is_degraded

def get_degraded_reason(self) -> Optional[str]:
"""Get the reason for degraded mode.

Returns:
Description of why degraded mode was entered, or None if healthy.
"""
return self._degraded_reason

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Align class and method docstrings with required Google-style sections.

This file has docstrings, but they don’t follow the required sectioned format consistently (including Attributes for the class and complete function sections).

As per coding guidelines, “All functions must have complete type annotations … and include descriptive docstrings” and “Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes.”

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/utils/degraded_mode.py` around lines 12 - 53, The class
DegradedModeTracker and its methods need to follow Google Python docstring
conventions more strictly. Add an Attributes section to the class docstring that
documents the _is_degraded and _degraded_reason instance variables with their
types and descriptions. For each method (set_degraded, set_healthy, is_degraded,
get_degraded_reason), ensure the docstrings include complete Google-style
sections: Parameters (if applicable), Returns, and Raises (if applicable). The
set_degraded method should include a Parameters section, while is_degraded and
get_degraded_reason already have Returns sections but may need Raises if
exceptions can be thrown.

Source: Coding guidelines

Comment on lines +382 to +396
mock_response = mocker.Mock()
auth: AuthTuple = ("test_user_id", "test_user", True, "test_token")

response = await readiness_probe_get_method(auth=auth, response=mock_response)

assert response is not None
assert isinstance(response, ReadinessResponse)
assert response.ready is True # Service is ready even in degraded mode
assert response.reason == "Service running in degraded mode"
assert response.overall_status == HealthStatus.DEGRADED
assert response.impacts is not None
assert "LLM inference unavailable" in response.impacts
assert "RAG functionality unavailable" in response.impacts
assert "Agent tools unavailable" in response.impacts
assert len(response.providers) == 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert degraded mode returns HTTP 200 in this test.

The test checks payload fields but misses the key probe contract for degraded mode. Use a real Response object and assert status code remains 200.

Suggested test hardening
+from fastapi import Response
...
-        mock_response = mocker.Mock()
+        mock_response = Response()
...
         response = await readiness_probe_get_method(auth=auth, response=mock_response)
...
+        assert mock_response.status_code == 200
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unit/app/endpoints/test_health.py` around lines 382 - 396, The test for
readiness_probe_get_method in degraded mode is checking response payload fields
but not verifying the HTTP status code contract. Replace the mocker.Mock()
object with a real Response object, and add an assertion to verify that the
response status code is 200 to ensure the probe correctly returns HTTP 200 even
in degraded mode.

@anik120

anik120 commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

Looks like e2e failed here because "disk ran out of space while building lls image" :')

https://redhat-internal.slack.com/archives/C08SG3FTQQ2/p1781809579008129 could help here too.

For now we'll have to re-trigger the tests 🤞🏽

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant