LCORE-1859: Enhance /readiness endpoint with degraded mode reporting#1947
LCORE-1859: Enhance /readiness endpoint with degraded mode reporting#1947anik120 wants to merge 1 commit into
Conversation
9bd1ce3 to
5dae3bf
Compare
WalkthroughIntroduces a ChangesDegraded Mode Tracking and Readiness Enrichment
Sequence Diagram(s)sequenceDiagram
participant Client
participant ReadinessEndpoint as GET /readiness
participant DegradedModeTracker
participant ProviderHealth as get_providers_health_statuses
participant ModelCheck as check_default_model_available
Client->>ReadinessEndpoint: GET /readiness
ReadinessEndpoint->>DegradedModeTracker: is_degraded()
alt degraded
DegradedModeTracker-->>ReadinessEndpoint: True + reason
ReadinessEndpoint-->>Client: 200 overall_status=DEGRADED, ready=True, impacts=[...]
else healthy
DegradedModeTracker-->>ReadinessEndpoint: False
ReadinessEndpoint->>ProviderHealth: get_providers_health_statuses()
ProviderHealth-->>ReadinessEndpoint: provider list
alt unhealthy providers
ReadinessEndpoint-->>Client: 503 overall_status=UNHEALTHY, ready=False, impacts=[...]
else providers OK
ReadinessEndpoint->>ModelCheck: check_default_model_available()
ModelCheck-->>ReadinessEndpoint: available or not
alt model unavailable
ReadinessEndpoint-->>Client: 503 overall_status=UNHEALTHY, ready=False, impacts=[...]
else model available
ReadinessEndpoint-->>Client: 200 overall_status=HEALTHY, ready=True, impacts=null
end
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
5dae3bf to
d70548e
Compare
|
Here's what the new response looks like: When llama-stack is down: |
d70548e to
4d4a3a9
Compare
lightspeed-core#1781 introduced "degraded mode support" - the ability to start lightspeed-stack and keep it running even when llama-stack server might not be available. This PR adds comprehensive degraded mode status reporting to the /readiness endpoint while maintaining clean API boundaries and Kubernetes probe semantics. - Enhanced HealthStatus enum with DEGRADED and UNHEALTHY service-level statuses while preserving provider-level statuses (OK, ERROR, NOT_IMPLEMENTED, UNKNOWN) - Enhanced /readiness endpoint to return 200 (ready=true) in degraded mode following Kubernetes semantics; only returns 503 when truly unhealthy - Refactored to avoid leaking implementation details in API responses: * Removed llama_stack field from ReadinessResponse * Removed Llama Stack version tracking from DegradedModeTracker * Focus on functional impacts rather than internal technology stack This design keeps internal implementation details (Llama Stack) private while exposing clear functional impacts to API consumers. Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
4d4a3a9 to
0a3e946
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
docs/openapi.json (1)
9964-17967:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winRegenerate the OpenAPI schema to fix CI pipeline failure.
The CI pipeline is failing because
docs/openapi.jsonis out of sync with the generated schema. This file should not be manually edited but regenerated from the source Pydantic models.Run the following command to regenerate:
uv run scripts/generate_openapi_schema.py docs/openapi.jsonThis will ensure consistency between the Python model definitions and the OpenAPI specification.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/openapi.json` around lines 9964 - 17967, The OpenAPI JSON is stale and no longer matches the generated schema, causing CI to fail. Regenerate docs/openapi.json from the source models instead of editing it manually by running scripts/generate_openapi_schema.py, and verify the updated output is consistent with the Pydantic schemas that define endpoints like readiness_probe_get_method_readiness_get, liveness_probe_get_method_liveness_get, and authorized_endpoint_handler_authorized_post.Source: Pipeline failures
tests/unit/test_degraded_mode.py (1)
9-60:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPrevent test order dependence from shared singleton state.
These tests mutate a process-wide singleton without resetting it between tests, so they can become flaky when run with different ordering or alongside other tests that touch
DegradedModeTracker.Suggested patch
+import pytest from utils.degraded_mode import DegradedModeTracker +@pytest.fixture(autouse=True) +def reset_degraded_mode_tracker() -> None: + tracker = DegradedModeTracker() + tracker.set_healthy() + yield + tracker.set_healthy() + + class TestDegradedModeTracker:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/test_degraded_mode.py` around lines 9 - 60, The tests are mutating the shared DegradedModeTracker singleton state without resetting it between tests, which causes test order dependence and flakiness. Add a pytest fixture with autouse=True that resets the DegradedModeTracker singleton to its initial healthy state (either by calling set_healthy() or by clearing the internal state) after each test completes. Alternatively, add a setUp or tearDown method to the test class that resets the singleton state before or after each test method.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/openapi.json`:
- Around line 13489-13502: The HealthStatus enum in src/models/common/health.py
has inconsistent casing where the "Error" value is capitalized while all other
enum values like "ok", "not_implemented", "unknown", "healthy", "degraded", and
"unhealthy" are lowercase. Either change the "Error" value to "error" to match
the lowercase convention used throughout the enum for consistency, or if this
capitalization is required for backwards compatibility with Llama Stack provider
responses, add explicit documentation to the HealthStatus enum docstring
explaining this dependency and why it differs from the other values.
In `@src/app/endpoints/health.py`:
- Line 151: Change the logger.info() call that logs "Response to /readiness
endpoint" to logger.debug() instead, since the readiness probe is called
frequently by orchestrators and logging at info level creates excessive noise in
the logs. Per coding guidelines, debug() should be used for diagnostic
information while info() should be reserved for significant program execution
events.
- Around line 159-163: The impact message strings ("LLM inference unavailable",
"RAG functionality unavailable", "Agent tools unavailable") are duplicated
across multiple locations in the health.py file (at lines 159-163, 186-189, and
214-214). Extract these string literals to shared constants in constants.py
(following the coding guideline to check constants.py before defining new ones),
then replace all hardcoded occurrences in the health endpoints with references
to these constants to ensure consistency and prevent message drift between
degraded and unhealthy responses.
In `@src/app/main.py`:
- Around line 91-95: When llama_stack_version is None, the code currently only
sets degraded mode if allow_degraded_mode is True, but fails to handle the case
where allow_degraded_mode is False. Add an else clause after the
degraded_tracker.set_degraded call to fail the startup process (using sys.exit
or raising an exception) when allow_degraded_mode is False, ensuring that
startup terminates when the connectivity check fails and degraded mode is
explicitly disabled.
In `@src/utils/degraded_mode.py`:
- Around line 12-53: The class DegradedModeTracker and its methods need to
follow Google Python docstring conventions more strictly. Add an Attributes
section to the class docstring that documents the _is_degraded and
_degraded_reason instance variables with their types and descriptions. For each
method (set_degraded, set_healthy, is_degraded, get_degraded_reason), ensure the
docstrings include complete Google-style sections: Parameters (if applicable),
Returns, and Raises (if applicable). The set_degraded method should include a
Parameters section, while is_degraded and get_degraded_reason already have
Returns sections but may need Raises if exceptions can be thrown.
In `@tests/unit/app/endpoints/test_health.py`:
- Around line 382-396: The test for readiness_probe_get_method in degraded mode
is checking response payload fields but not verifying the HTTP status code
contract. Replace the mocker.Mock() object with a real Response object, and add
an assertion to verify that the response status code is 200 to ensure the probe
correctly returns HTTP 200 even in degraded mode.
---
Outside diff comments:
In `@docs/openapi.json`:
- Around line 9964-17967: The OpenAPI JSON is stale and no longer matches the
generated schema, causing CI to fail. Regenerate docs/openapi.json from the
source models instead of editing it manually by running
scripts/generate_openapi_schema.py, and verify the updated output is consistent
with the Pydantic schemas that define endpoints like
readiness_probe_get_method_readiness_get,
liveness_probe_get_method_liveness_get, and
authorized_endpoint_handler_authorized_post.
In `@tests/unit/test_degraded_mode.py`:
- Around line 9-60: The tests are mutating the shared DegradedModeTracker
singleton state without resetting it between tests, which causes test order
dependence and flakiness. Add a pytest fixture with autouse=True that resets the
DegradedModeTracker singleton to its initial healthy state (either by calling
set_healthy() or by clearing the internal state) after each test completes.
Alternatively, add a setUp or tearDown method to the test class that resets the
singleton state before or after each test method.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: b244f354-9638-4b87-ab69-29115287d11b
📒 Files selected for processing (10)
docs/openapi.jsonsrc/app/endpoints/health.pysrc/app/main.pysrc/models/api/responses/successful/probes.pysrc/models/common/__init__.pysrc/models/common/health.pysrc/utils/degraded_mode.pytests/unit/app/endpoints/test_health.pytests/unit/models/responses/test_successful_responses.pytests/unit/test_degraded_mode.py
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (12)
- GitHub Check: build-pr
- GitHub Check: integration_tests (3.12)
- GitHub Check: integration_tests (3.13)
- GitHub Check: E2E: library mode / ci / group 2
- GitHub Check: E2E: server mode / ci / group 2
- GitHub Check: E2E: library mode / ci / group 3
- GitHub Check: E2E: server mode / ci / group 1
- GitHub Check: E2E: library mode / ci / group 1
- GitHub Check: E2E: server mode / ci / group 3
- GitHub Check: E2E Tests for Lightspeed Evaluation job
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-0-6-on-pull-request
🧰 Additional context used
📓 Path-based instructions (5)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Llama Stack imports: Usefrom llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Uselogger = get_logger(__name__)fromlog.pyfor module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Useasync deffor I/O operations and external API calls
Use standard log levels with clear purposes:debug()for diagnostic info,info()for program execution,warning()for unexpected events,error()for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes:Configuration,Error/Exception,Resolver,Interface
Abstract classes must use ABC with@abstractmethoddecorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes
Files:
src/utils/degraded_mode.pysrc/models/common/__init__.pysrc/app/main.pysrc/models/api/responses/successful/probes.pysrc/app/endpoints/health.pysrc/models/common/health.py
src/**/__init__.py
📄 CodeRabbit inference engine (AGENTS.md)
Package
__init__.pyfiles must contain brief package descriptions
Files:
src/models/common/__init__.py
src/models/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Pydantic models must use
@model_validatorand@field_validatorfor validation and complete type annotations for all attributes, avoidingAnytype
Files:
src/models/common/__init__.pysrc/models/api/responses/successful/probes.pysrc/models/common/health.py
src/app/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/app/**/*.py: FastAPI dependencies: Import fromfastapimodule forAPIRouter,HTTPException,Request,status,Depends
Use FastAPIHTTPExceptionwith appropriate status codes for API endpoints and handleAPIConnectionErrorfrom Llama Stack
Files:
src/app/main.pysrc/app/endpoints/health.py
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Usepytest.mark.asynciomarker for async tests
Files:
tests/unit/test_degraded_mode.pytests/unit/app/endpoints/test_health.pytests/unit/models/responses/test_successful_responses.py
🧠 Learnings (3)
📚 Learning: 2026-01-12T10:58:40.230Z
Learnt from: blublinsky
Repo: lightspeed-core/lightspeed-stack PR: 972
File: src/models/config.py:459-513
Timestamp: 2026-01-12T10:58:40.230Z
Learning: In lightspeed-core/lightspeed-stack, for Python files under src/models, when a user claims a fix is done but the issue persists, verify the current code state before accepting the fix. Steps: review the diff, fetch the latest changes, run relevant tests, reproduce the issue, search the codebase for lingering references to the original problem, confirm the fix is applied and not undone by subsequent commits, and validate with local checks to ensure the issue is resolved.
Applied to files:
src/models/common/__init__.pysrc/models/api/responses/successful/probes.pysrc/models/common/health.py
📚 Learning: 2026-02-25T07:46:33.545Z
Learnt from: asimurka
Repo: lightspeed-core/lightspeed-stack PR: 1211
File: src/models/responses.py:8-16
Timestamp: 2026-02-25T07:46:33.545Z
Learning: In the Python codebase, requests.py should use OpenAIResponseInputTool as Tool while responses.py uses OpenAIResponseTool as Tool. This difference is intentional due to differing schemas for input vs output tools in llama-stack-api. Apply this distinction consistently to other models under src/models (e.g., ensure request-related tools use the InputTool variant and response-related tools use the ResponseTool variant). If adding new tools, choose the corresponding InputTool or Tool class based on whether the tool represents input or output, and document the rationale in code comments.
Applied to files:
src/models/common/__init__.pysrc/models/api/responses/successful/probes.pysrc/models/common/health.py
📚 Learning: 2026-04-06T20:18:07.852Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.
Applied to files:
src/app/endpoints/health.py
🪛 GitHub Actions: OpenAPI (Spectral) / 0_spectral.txt
docs/openapi.json
[error] 1-1: CI failed because docs/openapi.json is out of date. Diff against regenerated schema /tmp/openapi-generated.json returned non-zero. Regenerate with: uv run scripts/generate_openapi_schema.py docs/openapi.json
🪛 GitHub Actions: OpenAPI (Spectral) / spectral
docs/openapi.json
[error] 1-1: OpenAPI schema mismatch: docs/openapi.json differs from generated schema (/tmp/openapi-generated.json).
🔇 Additional comments (6)
src/models/common/health.py (1)
10-35: LGTM!Also applies to: 57-57
src/models/common/__init__.py (1)
10-13: LGTM!src/models/api/responses/successful/probes.py (1)
3-11: LGTM!Also applies to: 52-123
docs/openapi.json (1)
17948-17965: Example payloads align with implementation and test expectations.The two schema examples (healthy and degraded states) correctly match:
- The healthy example omits
impacts(implicitly null), hasoverall_status: "healthy", and emptyproviders- The degraded example includes
impactsarray,overall_status: "degraded", and emptyprovidersThis aligns with the endpoint implementation in
src/app/endpoints/health.pyand the test assertion expecting exactly 2 examples.LGTM!
tests/unit/app/endpoints/test_health.py (1)
17-20: LGTM!Also applies to: 31-35, 59-61, 71-75, 110-112, 123-127, 155-157
tests/unit/models/responses/test_successful_responses.py (1)
38-38: LGTM!Also applies to: 416-425, 435-442, 461-462, 472-473
| "HealthStatus": { | ||
| "type": "string", | ||
| "enum": [ | ||
| "ok", | ||
| "Error", | ||
| "not_implemented", | ||
| "unknown", | ||
| "healthy", | ||
| "degraded", | ||
| "unhealthy" | ||
| ], | ||
| "title": "HealthStatus", | ||
| "description": "Health status enum for provider and service health checks.\n\nThis enum serves two purposes:\n\n1. Provider-level health (returned by Llama Stack providers):\n - OK: Provider is healthy and operational\n - ERROR: Provider is unhealthy or failed health check\n - NOT_IMPLEMENTED: Provider does not implement health checks\n - UNKNOWN: Fallback when provider status cannot be determined\n\n2. Service-level health (overall LCORE status):\n - HEALTHY: All systems operational, LLS connected, all providers healthy\n - DEGRADED: Service running with reduced functionality (e.g., LLS unavailable)\n - UNHEALTHY: Service connected but one or more providers are unhealthy" | ||
| }, |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check how HealthStatus enum is defined in source code to verify casing is intentional
# Find the HealthStatus enum definition
ast-grep --pattern $'class HealthStatus($$$):
$$$'
# Also search for any usage of HealthStatus.ERROR to understand the casing
rg -n "HealthStatus\." --type py -C2Repository: lightspeed-core/lightspeed-stack
Length of output: 13185
Address the mixed casing in the HealthStatus enum values.
The HealthStatus enum in src/models/common/health.py defines ERROR = "Error" with capitalization while all other values ("ok", "not_implemented", "unknown", "healthy", "degraded", "unhealthy") are lowercase.
If this capitalization is required for backwards compatibility with Llama Stack provider responses, add explicit documentation to the enum docstring explaining the dependency. Otherwise, normalize all values to lowercase for consistency across the API contract.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/openapi.json` around lines 13489 - 13502, The HealthStatus enum in
src/models/common/health.py has inconsistent casing where the "Error" value is
capitalized while all other enum values like "ok", "not_implemented", "unknown",
"healthy", "degraded", and "unhealthy" are lowercase. Either change the "Error"
value to "error" to match the lowercase convention used throughout the enum for
consistency, or if this capitalization is required for backwards compatibility
with Llama Stack provider responses, add explicit documentation to the
HealthStatus enum docstring explaining this dependency and why it differs from
the other values.
| _ = auth | ||
|
|
||
| logger.info("Response to /v1/readiness endpoint") | ||
| logger.info("Response to /readiness endpoint") |
There was a problem hiding this comment.
Use debug() for the readiness probe trace log.
/readiness is hit frequently by orchestrators, so emitting this at info will create noisy logs and reduce operational signal quality.
As per coding guidelines, "Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems."
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/app/endpoints/health.py` at line 151, Change the logger.info() call that
logs "Response to /readiness endpoint" to logger.debug() instead, since the
readiness probe is called frequently by orchestrators and logging at info level
creates excessive noise in the logs. Per coding guidelines, debug() should be
used for diagnostic information while info() should be reserved for significant
program execution events.
Source: Coding guidelines
| impacts = [ | ||
| "LLM inference unavailable", | ||
| "RAG functionality unavailable", | ||
| "Agent tools unavailable", | ||
| ] |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Extract readiness impact strings to shared constants.
These impact messages are duplicated across branches. Centralizing them avoids message drift between degraded/unhealthy responses and keeps API text consistent.
As per coding guidelines, "Check constants.py for shared constants before defining new ones."
Also applies to: 186-189, 214-214
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/app/endpoints/health.py` around lines 159 - 163, The impact message
strings ("LLM inference unavailable", "RAG functionality unavailable", "Agent
tools unavailable") are duplicated across multiple locations in the health.py
file (at lines 159-163, 186-189, and 214-214). Extract these string literals to
shared constants in constants.py (following the coding guideline to check
constants.py before defining new ones), then replace all hardcoded occurrences
in the health endpoints with references to these constants to ensure consistency
and prevent message drift between degraded and unhealthy responses.
Source: Coding guidelines
| if llama_stack_version is None: | ||
| logger.error("Cannot retrieve Llama Stack version, check connection") | ||
| if llama_stack_config.allow_degraded_mode: | ||
| degraded_tracker.set_degraded("Llama Stack connection check failed") | ||
| else: |
There was a problem hiding this comment.
Fail startup when connectivity check fails and degraded mode is disabled.
If version retrieval fails in this branch, startup currently continues unless degraded mode is enabled. That allows running in an unintended unhealthy state when degraded mode is explicitly off.
Suggested patch
if llama_stack_version is None:
logger.error("Cannot retrieve Llama Stack version, check connection")
if llama_stack_config.allow_degraded_mode:
degraded_tracker.set_degraded("Llama Stack connection check failed")
+ else:
+ raise RuntimeError(
+ "Cannot retrieve Llama Stack version while degraded mode is disabled"
+ )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/app/main.py` around lines 91 - 95, When llama_stack_version is None, the
code currently only sets degraded mode if allow_degraded_mode is True, but fails
to handle the case where allow_degraded_mode is False. Add an else clause after
the degraded_tracker.set_degraded call to fail the startup process (using
sys.exit or raising an exception) when allow_degraded_mode is False, ensuring
that startup terminates when the connectivity check fails and degraded mode is
explicitly disabled.
| class DegradedModeTracker(metaclass=Singleton): | ||
| """Track degraded mode state for Lightspeed Core Stack. | ||
|
|
||
| When LCORE cannot connect to Llama Stack during startup and | ||
| allow_degraded_mode is enabled, the service enters degraded mode. | ||
| This tracker maintains that state for health reporting. | ||
| """ | ||
|
|
||
| def __init__(self) -> None: | ||
| """Initialize the degraded mode tracker.""" | ||
| self._is_degraded: bool = False | ||
| self._degraded_reason: Optional[str] = None | ||
|
|
||
| def set_degraded(self, reason: str) -> None: | ||
| """Mark the service as running in degraded mode. | ||
|
|
||
| Parameters: | ||
| reason: Description of why degraded mode was entered. | ||
| """ | ||
| self._is_degraded = True | ||
| self._degraded_reason = reason | ||
|
|
||
| def set_healthy(self) -> None: | ||
| """Mark the service as running in healthy mode.""" | ||
| self._is_degraded = False | ||
| self._degraded_reason = None | ||
|
|
||
| def is_degraded(self) -> bool: | ||
| """Check if the service is running in degraded mode. | ||
|
|
||
| Returns: | ||
| True if service is in degraded mode, False otherwise. | ||
| """ | ||
| return self._is_degraded | ||
|
|
||
| def get_degraded_reason(self) -> Optional[str]: | ||
| """Get the reason for degraded mode. | ||
|
|
||
| Returns: | ||
| Description of why degraded mode was entered, or None if healthy. | ||
| """ | ||
| return self._degraded_reason |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Align class and method docstrings with required Google-style sections.
This file has docstrings, but they don’t follow the required sectioned format consistently (including Attributes for the class and complete function sections).
As per coding guidelines, “All functions must have complete type annotations … and include descriptive docstrings” and “Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/utils/degraded_mode.py` around lines 12 - 53, The class
DegradedModeTracker and its methods need to follow Google Python docstring
conventions more strictly. Add an Attributes section to the class docstring that
documents the _is_degraded and _degraded_reason instance variables with their
types and descriptions. For each method (set_degraded, set_healthy, is_degraded,
get_degraded_reason), ensure the docstrings include complete Google-style
sections: Parameters (if applicable), Returns, and Raises (if applicable). The
set_degraded method should include a Parameters section, while is_degraded and
get_degraded_reason already have Returns sections but may need Raises if
exceptions can be thrown.
Source: Coding guidelines
| mock_response = mocker.Mock() | ||
| auth: AuthTuple = ("test_user_id", "test_user", True, "test_token") | ||
|
|
||
| response = await readiness_probe_get_method(auth=auth, response=mock_response) | ||
|
|
||
| assert response is not None | ||
| assert isinstance(response, ReadinessResponse) | ||
| assert response.ready is True # Service is ready even in degraded mode | ||
| assert response.reason == "Service running in degraded mode" | ||
| assert response.overall_status == HealthStatus.DEGRADED | ||
| assert response.impacts is not None | ||
| assert "LLM inference unavailable" in response.impacts | ||
| assert "RAG functionality unavailable" in response.impacts | ||
| assert "Agent tools unavailable" in response.impacts | ||
| assert len(response.providers) == 0 |
There was a problem hiding this comment.
Assert degraded mode returns HTTP 200 in this test.
The test checks payload fields but misses the key probe contract for degraded mode. Use a real Response object and assert status code remains 200.
Suggested test hardening
+from fastapi import Response
...
- mock_response = mocker.Mock()
+ mock_response = Response()
...
response = await readiness_probe_get_method(auth=auth, response=mock_response)
...
+ assert mock_response.status_code == 200🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/unit/app/endpoints/test_health.py` around lines 382 - 396, The test for
readiness_probe_get_method in degraded mode is checking response payload fields
but not verifying the HTTP status code contract. Replace the mocker.Mock()
object with a real Response object, and add an assertion to verify that the
response status code is 200 to ensure the probe correctly returns HTTP 200 even
in degraded mode.
|
Looks like e2e failed here because "disk ran out of space while building lls image" :') https://redhat-internal.slack.com/archives/C08SG3FTQQ2/p1781809579008129 could help here too. For now we'll have to re-trigger the tests 🤞🏽 |
Description
#1781 introduced "degraded mode support" - the ability to start lightspeed-stack and keep it running even when llama-stack server might not be available.
This PR adds comprehensive degraded mode status reporting to the /readiness endpoint while maintaining clean API boundaries and Kubernetes probe semantics.
Enhanced HealthStatus enum with
DEGRADEDandUNHEALTHYservice-level statuses while preserving provider-level statuses (OK,ERROR,NOT_IMPLEMENTED,UNKNOWN)Enhanced /readiness endpoint to return 200 (ready=true) in degraded mode following Kubernetes semantics; only returns 503 when truly unhealthy
Refactored to avoid leaking implementation details in API responses:
This design keeps internal implementation details (Llama Stack) private while exposing clear functional impacts to API consumers.
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Documentation