-
Notifications
You must be signed in to change notification settings - Fork 94
LCORE-1859: Enhance /readiness endpoint with degraded mode reporting #1947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,8 +26,12 @@ | |
| LivenessResponse, | ||
| ReadinessResponse, | ||
| ) | ||
| from models.common import HealthStatus, ProviderHealthStatus | ||
| from models.common import ( | ||
| HealthStatus, | ||
| ProviderHealthStatus, | ||
| ) | ||
| from models.config import Action | ||
| from utils.degraded_mode import DegradedModeTracker | ||
|
|
||
| logger = get_logger(__name__) | ||
| router = APIRouter(tags=["health"]) | ||
|
|
@@ -117,11 +121,11 @@ async def readiness_probe_get_method( | |
| response: Response, | ||
| ) -> ReadinessResponse: | ||
| """ | ||
| Handle the readiness probe endpoint, returning service readiness. | ||
| Handle the readiness probe endpoint, returning service readiness and health status. | ||
|
|
||
| If any provider reports an error status, responds with HTTP 503 | ||
| and details of unhealthy providers; otherwise, indicates the | ||
| service is ready. | ||
| Returns comprehensive health information including overall service status, | ||
| provider health, and functional impacts. The service is considered "ready" even | ||
| in degraded mode (returns 200), but reports reduced functionality. | ||
|
|
||
| ### Parameters: | ||
| - response: The outgoing HTTP response (used by middleware). | ||
|
|
@@ -130,47 +134,94 @@ async def readiness_probe_get_method( | |
| ### Raises: | ||
| - HTTPException: with status 401 for unauthorized access. | ||
| - HTTPException: with status 403 if permission is denied. | ||
| - HTTPException: with status 500 and a detail object containing `response` | ||
| and `cause` when service configuration is wrong or incomplete. | ||
| - HTTPException: with status 503 and a detail object containing `response` | ||
| and `cause` when unable to connect to Llama Stack. | ||
| - HTTPException: with status 503 when service is unhealthy (providers down, | ||
| models unavailable) and degraded mode is not enabled. | ||
|
|
||
| ### Returns: | ||
| - ReadinessResponse: Object with `ready` indicating overall readiness, | ||
| `reason` explaining the outcome, and `providers` containing the list of | ||
| unhealthy ProviderHealthStatus entries (empty when ready). | ||
| - ReadinessResponse: Object with comprehensive health status including: | ||
| - ready: True if service can handle requests (even in degraded mode) | ||
| - reason: Description of service state | ||
| - overall_status: healthy, degraded, or unhealthy | ||
| - impacts: Functional limitations when degraded/unhealthy | ||
| - providers: List of unhealthy providers | ||
| """ | ||
| # Used only for authorization | ||
| _ = auth | ||
|
|
||
| logger.info("Response to /v1/readiness endpoint") | ||
| logger.info("Response to /readiness endpoint") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Use
As per coding guidelines, "Use standard log levels with clear purposes: 🤖 Prompt for AI AgentsSource: Coding guidelines |
||
|
|
||
| provider_statuses = await get_providers_health_statuses() | ||
| degraded_tracker = DegradedModeTracker() | ||
| is_degraded = degraded_tracker.is_degraded() | ||
|
|
||
| # Determine overall status | ||
| if is_degraded: | ||
| # Service is ready (can serve health checks, metrics, etc.) but degraded | ||
| impacts = [ | ||
| "LLM inference unavailable", | ||
| "RAG functionality unavailable", | ||
| "Agent tools unavailable", | ||
| ] | ||
|
Comment on lines
+159
to
+163
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win Extract readiness impact strings to shared constants. These impact messages are duplicated across branches. Centralizing them avoids message drift between degraded/unhealthy responses and keeps API text consistent. As per coding guidelines, "Check Also applies to: 186-189, 214-214 🤖 Prompt for AI AgentsSource: Coding guidelines |
||
| return ReadinessResponse( | ||
| ready=True, | ||
| reason="Service running in degraded mode", | ||
| overall_status=HealthStatus.DEGRADED, | ||
| impacts=impacts, | ||
| providers=[], | ||
| ) | ||
|
|
||
| # Check if any provider is unhealthy (not counting not_implemented as unhealthy) | ||
| # Not in degraded mode - check provider health | ||
| provider_statuses = await get_providers_health_statuses() | ||
| unhealthy_providers = [ | ||
| p for p in provider_statuses if p.status == HealthStatus.ERROR.value | ||
| ] | ||
|
|
||
| if unhealthy_providers: | ||
| ready = False | ||
| unhealthy_provider_names = [p.provider_id for p in unhealthy_providers] | ||
| reason = f"Providers not healthy: {', '.join(unhealthy_provider_names)}" | ||
| # Check if this is a connection error (provider_id="unknown") | ||
| is_connection_error = any( | ||
| p.provider_id == "unknown" for p in unhealthy_providers | ||
| ) | ||
|
|
||
| if is_connection_error: | ||
| reason = "Cannot connect to backend service" | ||
| impacts = [ | ||
| "LLM inference unavailable", | ||
| "Provider health checks unavailable", | ||
| ] | ||
| else: | ||
| unhealthy_provider_names = [p.provider_id for p in unhealthy_providers] | ||
| reason = f"Providers not healthy: {', '.join(unhealthy_provider_names)}" | ||
| impacts = [ | ||
| f"Provider {p.provider_id}: {p.message}" for p in unhealthy_providers | ||
| ] | ||
|
|
||
| response.status_code = status.HTTP_503_SERVICE_UNAVAILABLE | ||
| return ReadinessResponse( | ||
| ready=ready, reason=reason, providers=unhealthy_providers | ||
| ready=False, | ||
| reason=reason, | ||
| overall_status=HealthStatus.UNHEALTHY, | ||
| impacts=impacts, | ||
| providers=unhealthy_providers if not is_connection_error else [], | ||
| ) | ||
|
|
||
| # Check that the default model is registered in the model registry | ||
| model_available, model_reason = await check_default_model_available() | ||
| if not model_available: | ||
| response.status_code = status.HTTP_503_SERVICE_UNAVAILABLE | ||
| return ReadinessResponse( | ||
| ready=False, reason=model_reason, providers=unhealthy_providers | ||
| ready=False, | ||
| reason=model_reason, | ||
| overall_status=HealthStatus.UNHEALTHY, | ||
| impacts=["Default model not available in registry"], | ||
| providers=[], | ||
| ) | ||
|
|
||
| # All healthy | ||
| return ReadinessResponse( | ||
| ready=True, reason="All providers are healthy", providers=unhealthy_providers | ||
| ready=True, | ||
| reason="All providers are healthy", | ||
| overall_status=HealthStatus.HEALTHY, | ||
| impacts=None, | ||
| providers=[], | ||
| ) | ||
|
|
||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -26,6 +26,7 @@ | |
| from models.api.responses.error import InternalServerErrorResponse | ||
| from sentry import initialize_sentry | ||
| from utils.common import register_mcp_servers_async | ||
| from utils.degraded_mode import DegradedModeTracker | ||
| from utils.llama_stack_version import check_llama_stack_version | ||
|
|
||
| logger = get_logger(__name__) | ||
|
|
@@ -81,15 +82,19 @@ async def lifespan(_app: FastAPI) -> AsyncIterator[None]: | |
| await AsyncLlamaStackClientHolder().load(llama_stack_config) | ||
| client: AsyncLlamaStackClient = AsyncLlamaStackClientHolder().get_client() | ||
| logger.debug("Llama Stack client initialized, trying to connect to Llama Stack") | ||
| # check if the Llama Stack version is supported by the service | ||
| # Check connectivity to Llama Stack and set degraded mode if unavailable | ||
| degraded_tracker = DegradedModeTracker() | ||
| try: | ||
| llama_stack_version = await check_llama_stack_version( | ||
| client, llama_stack_config.max_retries, llama_stack_config.retry_delay | ||
| ) | ||
| if llama_stack_version is None: | ||
| logger.error("Cannot retrieve Llama Stack version, check connection") | ||
| if llama_stack_config.allow_degraded_mode: | ||
| degraded_tracker.set_degraded("Llama Stack connection check failed") | ||
| else: | ||
|
Comment on lines
91
to
95
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fail startup when connectivity check fails and degraded mode is disabled. If version retrieval fails in this branch, startup currently continues unless degraded mode is enabled. That allows running in an unintended unhealthy state when degraded mode is explicitly off. Suggested patch if llama_stack_version is None:
logger.error("Cannot retrieve Llama Stack version, check connection")
if llama_stack_config.allow_degraded_mode:
degraded_tracker.set_degraded("Llama Stack connection check failed")
+ else:
+ raise RuntimeError(
+ "Cannot retrieve Llama Stack version while degraded mode is disabled"
+ )🤖 Prompt for AI Agents |
||
| logger.debug("Llama Stack version: %s", llama_stack_version) | ||
| degraded_tracker.set_healthy() | ||
| except APIConnectionError as e: | ||
| # if degraded mode is allowed, simply ignore the exception | ||
| llama_stack_url = llama_stack_config.url | ||
|
|
@@ -103,6 +108,7 @@ async def lifespan(_app: FastAPI) -> AsyncIterator[None]: | |
| ) | ||
| if llama_stack_config.allow_degraded_mode: | ||
| logger.info("Entering degraded mode: LCORE running w/o Llama Stack") | ||
| degraded_tracker.set_degraded(f"Failed to connect to Llama Stack: {e!s}") | ||
| else: | ||
| raise | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: lightspeed-core/lightspeed-stack
Length of output: 13185
Address the mixed casing in the
HealthStatusenum values.The
HealthStatusenum insrc/models/common/health.pydefinesERROR = "Error"with capitalization while all other values ("ok","not_implemented","unknown","healthy","degraded","unhealthy") are lowercase.If this capitalization is required for backwards compatibility with Llama Stack provider responses, add explicit documentation to the enum docstring explaining the dependency. Otherwise, normalize all values to lowercase for consistency across the API contract.
🤖 Prompt for AI Agents