Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 67 additions & 8 deletions docs/openapi.json
Original file line number Diff line number Diff line change
Expand Up @@ -9964,7 +9964,7 @@
"health"
],
"summary": "Readiness Probe Get Method",
"description": "Handle the readiness probe endpoint, returning service readiness.\n\nIf any provider reports an error status, responds with HTTP 503\nand details of unhealthy providers; otherwise, indicates the\nservice is ready.\n\n### Parameters:\n- response: The outgoing HTTP response (used by middleware).\n- auth: Authentication tuple from the auth dependency (used by middleware).\n\n### Raises:\n- HTTPException: with status 401 for unauthorized access.\n- HTTPException: with status 403 if permission is denied.\n- HTTPException: with status 500 and a detail object containing `response`\n and `cause` when service configuration is wrong or incomplete.\n- HTTPException: with status 503 and a detail object containing `response`\n and `cause` when unable to connect to Llama Stack.\n\n### Returns:\n- ReadinessResponse: Object with `ready` indicating overall readiness,\n `reason` explaining the outcome, and `providers` containing the list of\n unhealthy ProviderHealthStatus entries (empty when ready).",
"description": "Handle the readiness probe endpoint, returning service readiness and health status.\n\nReturns comprehensive health information including overall service status,\nprovider health, and functional impacts. The service is considered \"ready\" even\nin degraded mode (returns 200), but reports reduced functionality.\n\n### Parameters:\n- response: The outgoing HTTP response (used by middleware).\n- auth: Authentication tuple from the auth dependency (used by middleware).\n\n### Raises:\n- HTTPException: with status 401 for unauthorized access.\n- HTTPException: with status 403 if permission is denied.\n- HTTPException: with status 503 when service is unhealthy (providers down,\n models unavailable) and degraded mode is not enabled.\n\n### Returns:\n- ReadinessResponse: Object with comprehensive health status including:\n - ready: True if service can handle requests (even in degraded mode)\n - reason: Description of service state\n - overall_status: healthy, degraded, or unhealthy\n - impacts: Functional limitations when degraded/unhealthy\n - providers: List of unhealthy providers",
"operationId": "readiness_probe_get_method_readiness_get",
"responses": {
"200": {
Expand All @@ -9975,9 +9975,10 @@
"$ref": "#/components/schemas/ReadinessResponse"
},
"example": {
"overall_status": "healthy",
"providers": [],
"ready": true,
"reason": "Service is ready"
"reason": "All providers are healthy"
}
}
}
Expand Down Expand Up @@ -13485,6 +13486,20 @@
"type": "object",
"title": "HTTPValidationError"
},
"HealthStatus": {
"type": "string",
"enum": [
"ok",
"Error",
"not_implemented",
"unknown",
"healthy",
"degraded",
"unhealthy"
],
"title": "HealthStatus",
"description": "Health status enum for provider and service health checks.\n\nThis enum serves two purposes:\n\n1. Provider-level health (returned by Llama Stack providers):\n - OK: Provider is healthy and operational\n - ERROR: Provider is unhealthy or failed health check\n - NOT_IMPLEMENTED: Provider does not implement health checks\n - UNKNOWN: Fallback when provider status cannot be determined\n\n2. Service-level health (overall LCORE status):\n - HEALTHY: All systems operational, LLS connected, all providers healthy\n - DEGRADED: Service running with reduced functionality (e.g., LLS unavailable)\n - UNHEALTHY: Service connected but one or more providers are unhealthy"
},
Comment on lines +13489 to +13502

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check how HealthStatus enum is defined in source code to verify casing is intentional

# Find the HealthStatus enum definition
ast-grep --pattern $'class HealthStatus($$$):
  $$$'

# Also search for any usage of HealthStatus.ERROR to understand the casing
rg -n "HealthStatus\." --type py -C2

Repository: lightspeed-core/lightspeed-stack

Length of output: 13185


Address the mixed casing in the HealthStatus enum values.

The HealthStatus enum in src/models/common/health.py defines ERROR = "Error" with capitalization while all other values ("ok", "not_implemented", "unknown", "healthy", "degraded", "unhealthy") are lowercase.

If this capitalization is required for backwards compatibility with Llama Stack provider responses, add explicit documentation to the enum docstring explaining the dependency. Otherwise, normalize all values to lowercase for consistency across the API contract.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/openapi.json` around lines 13489 - 13502, The HealthStatus enum in
src/models/common/health.py has inconsistent casing where the "Error" value is
capitalized while all other enum values like "ok", "not_implemented", "unknown",
"healthy", "degraded", and "unhealthy" are lowercase. Either change the "Error"
value to "error" to match the lowercase convention used throughout the enum for
consistency, or if this capitalization is required for backwards compatibility
with Llama Stack provider responses, add explicit documentation to the
HealthStatus enum docstring explaining this dependency and why it differs from
the other values.

"ImplicitOAuthFlow": {
"properties": {
"authorizationUrl": {
Expand Down Expand Up @@ -16888,7 +16903,7 @@
"description": "Optional message about the health status",
"examples": [
"All systems operational",
"Llama Stack is unavailable"
"Provider is unavailable"
]
}
},
Expand Down Expand Up @@ -17866,7 +17881,7 @@
"ready": {
"type": "boolean",
"title": "Ready",
"description": "Flag indicating if service is ready",
"description": "Flag indicating if service is ready to handle requests",
"examples": [
true,
false
Expand All @@ -17875,34 +17890,78 @@
"reason": {
"type": "string",
"title": "Reason",
"description": "The reason for the readiness",
"description": "The reason for the readiness status",
"examples": [
"Service is ready"
]
},
"overall_status": {
"$ref": "#/components/schemas/HealthStatus",
"description": "Overall service health status",
"examples": [
"healthy",
"degraded",
"unhealthy"
]
},
"impacts": {
"anyOf": [
{
"items": {
"type": "string"
},
"type": "array"
},
{
"type": "null"
}
],
"title": "Impacts",
"description": "List of functional impacts when service is degraded or unhealthy",
"examples": [
[
"LLM inference unavailable",
"RAG functionality unavailable",
"Agent tools unavailable"
]
]
},
"providers": {
"items": {
"$ref": "#/components/schemas/ProviderHealthStatus"
},
"type": "array",
"title": "Providers",
"description": "List of unhealthy providers in case of readiness failure.",
"description": "List of unhealthy providers (empty when all healthy)",
"examples": []
}
},
"type": "object",
"required": [
"ready",
"reason",
"overall_status",
"providers"
],
"title": "ReadinessResponse",
"description": "Model representing response to a readiness request.\n\nAttributes:\n ready: If service is ready.\n reason: The reason for the readiness.\n providers: List of unhealthy providers in case of readiness failure.",
"description": "Model representing response to a readiness request.\n\nAttributes:\n ready: If service is ready to handle requests.\n reason: The reason for the readiness status.\n overall_status: Overall service health status (healthy/degraded/unhealthy).\n impacts: Optional list of functional impacts when degraded or unhealthy.\n providers: List of unhealthy providers (empty when all healthy).",
"examples": [
{
"overall_status": "healthy",
"providers": [],
"ready": true,
"reason": "All providers are healthy"
},
{
"impacts": [
"LLM inference unavailable",
"RAG functionality unavailable",
"Agent tools unavailable"
],
"overall_status": "degraded",
"providers": [],
"ready": true,
"reason": "Service is ready"
"reason": "Service running in degraded mode"
}
]
},
Expand Down
93 changes: 72 additions & 21 deletions src/app/endpoints/health.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,12 @@
LivenessResponse,
ReadinessResponse,
)
from models.common import HealthStatus, ProviderHealthStatus
from models.common import (
HealthStatus,
ProviderHealthStatus,
)
from models.config import Action
from utils.degraded_mode import DegradedModeTracker

logger = get_logger(__name__)
router = APIRouter(tags=["health"])
Expand Down Expand Up @@ -117,11 +121,11 @@ async def readiness_probe_get_method(
response: Response,
) -> ReadinessResponse:
"""
Handle the readiness probe endpoint, returning service readiness.
Handle the readiness probe endpoint, returning service readiness and health status.

If any provider reports an error status, responds with HTTP 503
and details of unhealthy providers; otherwise, indicates the
service is ready.
Returns comprehensive health information including overall service status,
provider health, and functional impacts. The service is considered "ready" even
in degraded mode (returns 200), but reports reduced functionality.

### Parameters:
- response: The outgoing HTTP response (used by middleware).
Expand All @@ -130,47 +134,94 @@ async def readiness_probe_get_method(
### Raises:
- HTTPException: with status 401 for unauthorized access.
- HTTPException: with status 403 if permission is denied.
- HTTPException: with status 500 and a detail object containing `response`
and `cause` when service configuration is wrong or incomplete.
- HTTPException: with status 503 and a detail object containing `response`
and `cause` when unable to connect to Llama Stack.
- HTTPException: with status 503 when service is unhealthy (providers down,
models unavailable) and degraded mode is not enabled.

### Returns:
- ReadinessResponse: Object with `ready` indicating overall readiness,
`reason` explaining the outcome, and `providers` containing the list of
unhealthy ProviderHealthStatus entries (empty when ready).
- ReadinessResponse: Object with comprehensive health status including:
- ready: True if service can handle requests (even in degraded mode)
- reason: Description of service state
- overall_status: healthy, degraded, or unhealthy
- impacts: Functional limitations when degraded/unhealthy
- providers: List of unhealthy providers
"""
# Used only for authorization
_ = auth

logger.info("Response to /v1/readiness endpoint")
logger.info("Response to /readiness endpoint")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use debug() for the readiness probe trace log.

/readiness is hit frequently by orchestrators, so emitting this at info will create noisy logs and reduce operational signal quality.

As per coding guidelines, "Use standard log levels with clear purposes: debug() for diagnostic info, info() for program execution, warning() for unexpected events, error() for serious problems."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/app/endpoints/health.py` at line 151, Change the logger.info() call that
logs "Response to /readiness endpoint" to logger.debug() instead, since the
readiness probe is called frequently by orchestrators and logging at info level
creates excessive noise in the logs. Per coding guidelines, debug() should be
used for diagnostic information while info() should be reserved for significant
program execution events.

Source: Coding guidelines


provider_statuses = await get_providers_health_statuses()
degraded_tracker = DegradedModeTracker()
is_degraded = degraded_tracker.is_degraded()

# Determine overall status
if is_degraded:
# Service is ready (can serve health checks, metrics, etc.) but degraded
impacts = [
"LLM inference unavailable",
"RAG functionality unavailable",
"Agent tools unavailable",
]
Comment on lines +159 to +163

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Extract readiness impact strings to shared constants.

These impact messages are duplicated across branches. Centralizing them avoids message drift between degraded/unhealthy responses and keeps API text consistent.

As per coding guidelines, "Check constants.py for shared constants before defining new ones."

Also applies to: 186-189, 214-214

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/app/endpoints/health.py` around lines 159 - 163, The impact message
strings ("LLM inference unavailable", "RAG functionality unavailable", "Agent
tools unavailable") are duplicated across multiple locations in the health.py
file (at lines 159-163, 186-189, and 214-214). Extract these string literals to
shared constants in constants.py (following the coding guideline to check
constants.py before defining new ones), then replace all hardcoded occurrences
in the health endpoints with references to these constants to ensure consistency
and prevent message drift between degraded and unhealthy responses.

Source: Coding guidelines

return ReadinessResponse(
ready=True,
reason="Service running in degraded mode",
overall_status=HealthStatus.DEGRADED,
impacts=impacts,
providers=[],
)

# Check if any provider is unhealthy (not counting not_implemented as unhealthy)
# Not in degraded mode - check provider health
provider_statuses = await get_providers_health_statuses()
unhealthy_providers = [
p for p in provider_statuses if p.status == HealthStatus.ERROR.value
]

if unhealthy_providers:
ready = False
unhealthy_provider_names = [p.provider_id for p in unhealthy_providers]
reason = f"Providers not healthy: {', '.join(unhealthy_provider_names)}"
# Check if this is a connection error (provider_id="unknown")
is_connection_error = any(
p.provider_id == "unknown" for p in unhealthy_providers
)

if is_connection_error:
reason = "Cannot connect to backend service"
impacts = [
"LLM inference unavailable",
"Provider health checks unavailable",
]
else:
unhealthy_provider_names = [p.provider_id for p in unhealthy_providers]
reason = f"Providers not healthy: {', '.join(unhealthy_provider_names)}"
impacts = [
f"Provider {p.provider_id}: {p.message}" for p in unhealthy_providers
]

response.status_code = status.HTTP_503_SERVICE_UNAVAILABLE
return ReadinessResponse(
ready=ready, reason=reason, providers=unhealthy_providers
ready=False,
reason=reason,
overall_status=HealthStatus.UNHEALTHY,
impacts=impacts,
providers=unhealthy_providers if not is_connection_error else [],
)

# Check that the default model is registered in the model registry
model_available, model_reason = await check_default_model_available()
if not model_available:
response.status_code = status.HTTP_503_SERVICE_UNAVAILABLE
return ReadinessResponse(
ready=False, reason=model_reason, providers=unhealthy_providers
ready=False,
reason=model_reason,
overall_status=HealthStatus.UNHEALTHY,
impacts=["Default model not available in registry"],
providers=[],
)

# All healthy
return ReadinessResponse(
ready=True, reason="All providers are healthy", providers=unhealthy_providers
ready=True,
reason="All providers are healthy",
overall_status=HealthStatus.HEALTHY,
impacts=None,
providers=[],
)


Expand Down
8 changes: 7 additions & 1 deletion src/app/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
from models.api.responses.error import InternalServerErrorResponse
from sentry import initialize_sentry
from utils.common import register_mcp_servers_async
from utils.degraded_mode import DegradedModeTracker
from utils.llama_stack_version import check_llama_stack_version

logger = get_logger(__name__)
Expand Down Expand Up @@ -81,15 +82,19 @@ async def lifespan(_app: FastAPI) -> AsyncIterator[None]:
await AsyncLlamaStackClientHolder().load(llama_stack_config)
client: AsyncLlamaStackClient = AsyncLlamaStackClientHolder().get_client()
logger.debug("Llama Stack client initialized, trying to connect to Llama Stack")
# check if the Llama Stack version is supported by the service
# Check connectivity to Llama Stack and set degraded mode if unavailable
degraded_tracker = DegradedModeTracker()
try:
llama_stack_version = await check_llama_stack_version(
client, llama_stack_config.max_retries, llama_stack_config.retry_delay
)
if llama_stack_version is None:
logger.error("Cannot retrieve Llama Stack version, check connection")
if llama_stack_config.allow_degraded_mode:
degraded_tracker.set_degraded("Llama Stack connection check failed")
else:
Comment on lines 91 to 95

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail startup when connectivity check fails and degraded mode is disabled.

If version retrieval fails in this branch, startup currently continues unless degraded mode is enabled. That allows running in an unintended unhealthy state when degraded mode is explicitly off.

Suggested patch
         if llama_stack_version is None:
             logger.error("Cannot retrieve Llama Stack version, check connection")
             if llama_stack_config.allow_degraded_mode:
                 degraded_tracker.set_degraded("Llama Stack connection check failed")
+            else:
+                raise RuntimeError(
+                    "Cannot retrieve Llama Stack version while degraded mode is disabled"
+                )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/app/main.py` around lines 91 - 95, When llama_stack_version is None, the
code currently only sets degraded mode if allow_degraded_mode is True, but fails
to handle the case where allow_degraded_mode is False. Add an else clause after
the degraded_tracker.set_degraded call to fail the startup process (using
sys.exit or raising an exception) when allow_degraded_mode is False, ensuring
that startup terminates when the connectivity check fails and degraded mode is
explicitly disabled.

logger.debug("Llama Stack version: %s", llama_stack_version)
degraded_tracker.set_healthy()
except APIConnectionError as e:
# if degraded mode is allowed, simply ignore the exception
llama_stack_url = llama_stack_config.url
Expand All @@ -103,6 +108,7 @@ async def lifespan(_app: FastAPI) -> AsyncIterator[None]:
)
if llama_stack_config.allow_degraded_mode:
logger.info("Entering degraded mode: LCORE running w/o Llama Stack")
degraded_tracker.set_degraded(f"Failed to connect to Llama Stack: {e!s}")
else:
raise

Expand Down
Loading
Loading