LCORE-1857: Add metrics to track degraded mode startup state#1985
Conversation
Walkthrough
ChangesStartup metrics initialization and degraded mode gauge
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/utils/degraded_mode.py`:
- Around line 35-44: The `set_healthy()` method is calling
`recording.set_started_in_degraded_mode(False)` which overwrites the
startup-state metric, erasing evidence that the service started degraded. Since
this metric should only capture the startup state and be write-once per process,
remove the line `recording.set_started_in_degraded_mode(False)` from the
`set_healthy()` method. The metric should only be set once during
initialization, not updated when transitioning between degraded and healthy
modes later.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 7902cec5-73e6-43fa-ab6a-79464359fa17
📒 Files selected for processing (6)
src/app/endpoints/metrics.pysrc/app/main.pysrc/metrics/__init__.pysrc/metrics/recording.pysrc/metrics/utils.pysrc/utils/degraded_mode.py
📜 Review details
⏰ Context from checks skipped due to timeout. (15)
- GitHub Check: Pylinter
- GitHub Check: unit_tests (3.13)
- GitHub Check: unit_tests (3.12)
- GitHub Check: build-pr
- GitHub Check: integration_tests (3.12)
- GitHub Check: integration_tests (3.13)
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-0-6-on-pull-request
- GitHub Check: E2E: server mode / ci / group 3
- GitHub Check: E2E: server mode / ci / group 1
- GitHub Check: E2E: library mode / ci / group 1
- GitHub Check: E2E: server mode / ci / group 2
- GitHub Check: E2E: library mode / ci / group 3
- GitHub Check: E2E: library mode / ci / group 2
- GitHub Check: E2E Tests for Lightspeed Evaluation job
🧰 Additional context used
📓 Path-based instructions (3)
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Llama Stack imports: Usefrom llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Uselogger = get_logger(__name__)fromlog.pyfor module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Useasync deffor I/O operations and external API calls
Use standard log levels with clear purposes:debug()for diagnostic info,info()for program execution,warning()for unexpected events,error()for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes:Configuration,Error/Exception,Resolver,Interface
Abstract classes must use ABC with@abstractmethoddecorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes
Files:
src/metrics/__init__.pysrc/metrics/recording.pysrc/app/endpoints/metrics.pysrc/utils/degraded_mode.pysrc/metrics/utils.pysrc/app/main.py
src/**/__init__.py
📄 CodeRabbit inference engine (AGENTS.md)
Package
__init__.pyfiles must contain brief package descriptions
Files:
src/metrics/__init__.py
src/app/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/app/**/*.py: FastAPI dependencies: Import fromfastapimodule forAPIRouter,HTTPException,Request,status,Depends
Use FastAPIHTTPExceptionwith appropriate status codes for API endpoints and handleAPIConnectionErrorfrom Llama Stack
Files:
src/app/endpoints/metrics.pysrc/app/main.py
🧠 Learnings (1)
📚 Learning: 2026-04-06T20:18:07.852Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.
Applied to files:
src/app/endpoints/metrics.py
🔇 Additional comments (6)
src/metrics/__init__.py (1)
85-90: LGTM!src/metrics/recording.py (1)
162-174: LGTM!src/utils/degraded_mode.py (1)
9-9: LGTM!src/metrics/utils.py (1)
13-20: LGTM!src/app/main.py (1)
26-26: LGTM!Also applies to: 124-130
src/app/endpoints/metrics.py (1)
49-49: LGTM!Also applies to: 65-65
| # Record startup state metric | ||
| recording.set_started_in_degraded_mode(True) | ||
|
|
||
| def set_healthy(self) -> None: | ||
| """Mark the service as running in healthy mode.""" | ||
| self._is_degraded = False | ||
| self._degraded_reason = None | ||
|
|
||
| # Record startup state metric | ||
| recording.set_started_in_degraded_mode(False) |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Preserve startup-only semantics for ls_started_in_degraded_mode.
Line 43 resets the gauge to 0, which can erase evidence that the service started degraded if set_healthy() is called after startup. For a startup-state metric, this should be write-once per process startup.
💡 Suggested fix
class DegradedModeTracker(metaclass=Singleton):
@@
def __init__(self) -> None:
"""Initialize the degraded mode tracker."""
self._is_degraded: bool = False
self._degraded_reason: Optional[str] = None
+ self._startup_state_recorded: bool = False
+
+ def _record_startup_state_once(self, is_degraded: bool) -> None:
+ """Record startup degraded-state metric only once per process start."""
+ if self._startup_state_recorded:
+ return
+ recording.set_started_in_degraded_mode(is_degraded)
+ self._startup_state_recorded = True
@@
- recording.set_started_in_degraded_mode(True)
+ self._record_startup_state_once(True)
@@
- recording.set_started_in_degraded_mode(False)
+ self._record_startup_state_once(False)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/utils/degraded_mode.py` around lines 35 - 44, The `set_healthy()` method
is calling `recording.set_started_in_degraded_mode(False)` which overwrites the
startup-state metric, erasing evidence that the service started degraded. Since
this metric should only capture the startup state and be write-once per process,
remove the line `recording.set_started_in_degraded_mode(False)` from the
`set_healthy()` method. The metric should only be set once during
initialization, not updated when transitioning between degraded and healthy
modes later.
cee4a9a to
b169143
Compare
Add Prometheus metrics to monitor when Lightspeed Core Stack starts in degraded mode (without llama-stack connectivity). During implementation, discovered that PR lightspeed-core#327's lazy initialization pattern prevented model metrics from being set up correctly, requiring a revert. When LCORE starts in degraded mode, the @run_once_async decorator marks setup_model_metrics() as "done" after it returns early. On restart with llama-stack available, the function never runs again, leaving model metrics uninitialized. Moving setup back to startup fixes this while degraded mode already provides the startup resilience that PR lightspeed-core#327 originally addressed. Signed-off-by: Anik Bhattacharjee <anbhatta@redhat.com>
b169143 to
359b19e
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/utils/degraded_mode.py (1)
43-44: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winStartup-state gauge is reset on later healthy transitions.
set_healthy()writes0tols_started_in_degraded_mode, which can overwrite the recorded startup state if a healthy transition occurs after startup. For a startup-state metric this should be write-once per process.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/utils/degraded_mode.py` around lines 43 - 44, The startup-state gauge is being overwritten by later healthy transitions in set_healthy(), which makes ls_started_in_degraded_mode no longer reflect only process startup state. Update the degraded_mode logic so set_started_in_degraded_mode is written once during startup and not reset by later calls to set_healthy(); keep the startup recording in the startup path and remove or guard the zero-write in set_healthy() while preserving the behavior of the other degraded mode metrics.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/openapi.json`:
- Line 10403: The OpenAPI schema artifact is out of sync with the generated
runtime schema, causing the Spectral CI drift failure. Regenerate the schema
using the existing generator workflow that updates docs/openapi.json, then
commit the refreshed artifact so it matches the current API definition; the key
target to verify is the metrics endpoint description in the generated OpenAPI
output.
In `@src/metrics/recording.py`:
- Around line 168-169: The docstring for the relevant function in recording.py
uses the nonstandard Args: header; update it to the repo’s required Parameters:
section while keeping the existing is_degraded argument description intact.
Locate the affected docstring near the metrics recording function and replace
the header so it matches the standard used throughout the codebase.
---
Duplicate comments:
In `@src/utils/degraded_mode.py`:
- Around line 43-44: The startup-state gauge is being overwritten by later
healthy transitions in set_healthy(), which makes ls_started_in_degraded_mode no
longer reflect only process startup state. Update the degraded_mode logic so
set_started_in_degraded_mode is written once during startup and not reset by
later calls to set_healthy(); keep the startup recording in the startup path and
remove or guard the zero-write in set_healthy() while preserving the behavior of
the other degraded mode metrics.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: d9a69587-8ade-4368-af9e-305ac6f3550a
📒 Files selected for processing (8)
docs/openapi.jsonsrc/app/endpoints/metrics.pysrc/app/main.pysrc/metrics/__init__.pysrc/metrics/recording.pysrc/metrics/utils.pysrc/utils/degraded_mode.pytests/unit/app/endpoints/test_metrics.py
📜 Review details
⏰ Context from checks skipped due to timeout. (2)
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-0-6-on-pull-request
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
🧰 Additional context used
📓 Path-based instructions (4)
tests/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
tests/**/*.py: Use pytest for all unit and integration tests; do not use unittest
Usepytest.mark.asynciomarker for async tests
Files:
tests/unit/app/endpoints/test_metrics.py
src/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/**/*.py: Use absolute imports for internal modules:from authentication import get_auth_dependency
Llama Stack imports: Usefrom llama_stack_client import AsyncLlamaStackClient
Checkconstants.pyfor shared constants before defining new ones
All modules must start with descriptive docstrings explaining purpose
Uselogger = get_logger(__name__)fromlog.pyfor module logging
All functions must have complete type annotations for parameters and return types, use modern syntax (str | int), and include descriptive docstrings
Use snake_case with descriptive, action-oriented names for functions (get_, validate_, check_)
Avoid in-place parameter modification anti-patterns; return new data structures instead of modifying function parameters
Useasync deffor I/O operations and external API calls
Use standard log levels with clear purposes:debug()for diagnostic info,info()for program execution,warning()for unexpected events,error()for serious problems
All classes must have descriptive docstrings explaining purpose and use PascalCase with standard suffixes:Configuration,Error/Exception,Resolver,Interface
Abstract classes must use ABC with@abstractmethoddecorators
Follow Google Python docstring conventions with required sections: Parameters, Returns, Raises, and Attributes for classes
Files:
src/metrics/recording.pysrc/app/endpoints/metrics.pysrc/utils/degraded_mode.pysrc/app/main.pysrc/metrics/__init__.pysrc/metrics/utils.py
src/app/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
src/app/**/*.py: FastAPI dependencies: Import fromfastapimodule forAPIRouter,HTTPException,Request,status,Depends
Use FastAPIHTTPExceptionwith appropriate status codes for API endpoints and handleAPIConnectionErrorfrom Llama Stack
Files:
src/app/endpoints/metrics.pysrc/app/main.py
src/**/__init__.py
📄 CodeRabbit inference engine (AGENTS.md)
Package
__init__.pyfiles must contain brief package descriptions
Files:
src/metrics/__init__.py
🧠 Learnings (2)
📚 Learning: 2026-06-24T13:45:37.249Z
Learnt from: Jdubrick
Repo: lightspeed-core/lightspeed-stack PR: 1971
File: src/utils/markdown_repair.py:31-36
Timestamp: 2026-06-24T13:45:37.249Z
Learning: In the lightspeed-stack repository, docstrings must use the section header name "Parameters:" (not "Args:") for function arguments, even if the project references Google Python docstring conventions. Ensure docstrings follow the project’s established "Parameters:" header format for any documented function parameters.
Applied to files:
tests/unit/app/endpoints/test_metrics.pysrc/metrics/recording.pysrc/app/endpoints/metrics.pysrc/utils/degraded_mode.pysrc/app/main.pysrc/metrics/__init__.pysrc/metrics/utils.py
📚 Learning: 2026-04-06T20:18:07.852Z
Learnt from: major
Repo: lightspeed-core/lightspeed-stack PR: 1463
File: src/app/endpoints/rlsapi_v1.py:266-271
Timestamp: 2026-04-06T20:18:07.852Z
Learning: In the lightspeed-stack codebase, within `src/app/endpoints/` inference/MCP endpoints, treat `tools: Optional[list[Any]]` in MCP tool definitions as an intentional, consistent typing pattern (used across `query`, `responses`, `streaming_query`, `rlsapi_v1`). Do not raise or suggest this as a typing issue during code review; changing it in isolation could break endpoint typing consistency across the codebase.
Applied to files:
src/app/endpoints/metrics.py
🪛 GitHub Actions: OpenAPI (Spectral) / 0_spectral.txt
docs/openapi.json
[error] 1-1: docs/openapi.json is out of date (generated schema differs from /tmp/openapi-generated.json). Regenerate with: uv run python scripts/generate_openapi_schema.py docs/openapi.json
🔇 Additional comments (5)
src/metrics/__init__.py (1)
86-90: LGTM!src/metrics/utils.py (1)
12-20: LGTM!src/app/endpoints/metrics.py (1)
64-65: LGTM!src/app/main.py (1)
126-129: 🩺 Stability & AvailabilityBroaden startup error handling if model-metrics setup should not abort boot
setup_model_metrics()can still raiseAPIStatusErrorfrommodels.list()orHTTPExceptionfromcheck_configuration_loaded(), but the lifespan only catchesAPIConnectionError. If healthy-mode startup should tolerate those failures too, handle them here; otherwise this fail-fast behavior is fine.tests/unit/app/endpoints/test_metrics.py (1)
7-7: LGTM!Also applies to: 43-43
| ], | ||
| "summary": "Metrics Endpoint Handler", | ||
| "description": "Handle request to the /metrics endpoint.\n\nProcess GET requests to the /metrics endpoint, returning the\nlatest Prometheus metrics in form of a plain text.\n\nInitializes model metrics on the first request if not already\nset up, then responds with the current metrics snapshot in\nPrometheus format.\n\n### Parameters:\n- request: The incoming HTTP request (used by middleware).\n- auth: Authentication tuple from the auth dependency (used by middleware).\n\n### Returns:\n- PlainTextResponse: Response body containing the Prometheus metrics text\n and the Prometheus content type.", | ||
| "description": "Handle request to the /metrics endpoint.\n\nProcess GET requests to the /metrics endpoint, returning the\nlatest Prometheus metrics in plain text Prometheus format.\n\n### Parameters:\n- request: The incoming HTTP request (used by middleware).\n- auth: Authentication tuple from the auth dependency (used by middleware).\n\n### Returns:\n- PlainTextResponse: Response body containing the Prometheus metrics text\n and the Prometheus content type.", |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Regenerate and commit OpenAPI schema to fix CI drift.
The Spectral job is failing because docs/openapi.json no longer matches the generated schema. Please regenerate with uv run python scripts/generate_openapi_schema.py docs/openapi.json and commit the updated artifact to keep docs and runtime schema in sync.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@docs/openapi.json` at line 10403, The OpenAPI schema artifact is out of sync
with the generated runtime schema, causing the Spectral CI drift failure.
Regenerate the schema using the existing generator workflow that updates
docs/openapi.json, then commit the refreshed artifact so it matches the current
API definition; the key target to verify is the metrics endpoint description in
the generated OpenAPI output.
Source: Pipeline failures
| Args: | ||
| is_degraded: True if service started in degraded mode, False if healthy. |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
Use the Parameters: docstring header instead of Args:.
This repo standardizes on the Parameters: section header for documented arguments.
📝 Proposed fix
- Args:
- is_degraded: True if service started in degraded mode, False if healthy.
+ Parameters:
+ is_degraded: True if service started in degraded mode, False if healthy.Based on learnings: docstrings must use the section header name "Parameters:" (not "Args:") for function arguments.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Args: | |
| is_degraded: True if service started in degraded mode, False if healthy. | |
| Parameters: | |
| is_degraded: True if service started in degraded mode, False if healthy. |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/metrics/recording.py` around lines 168 - 169, The docstring for the
relevant function in recording.py uses the nonstandard Args: header; update it
to the repo’s required Parameters: section while keeping the existing
is_degraded argument description intact. Locate the affected docstring near the
metrics recording function and replace the header so it matches the standard
used throughout the codebase.
Source: Learnings
Description
Add Prometheus metrics to monitor when Lightspeed Core Stack starts in degraded mode (without llama-stack connectivity). During implementation, discovered that PR #327's lazy initialization pattern prevented model metrics from being set up correctly, requiring a revert.
When LCORE starts in degraded mode, the @run_once_async decorator marks setup_model_metrics() as "done" after it returns early. On restart with llama-stack available, the function never runs again, leaving model metrics uninitialized. Moving setup back to startup fixes this while degraded mode already provides the startup resilience that PR #327 originally addressed.
Type of change
Tools used to create PR
Identify any AI code assistants used in this PR (for transparency and review context)
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
Release Notes
New Features
ls_started_in_degraded_mode) to indicate whether the service started in degraded or healthy mode.Improvements
/metricsrequest./metricsnow consistently returns the latest Prometheus metrics in plain text format.Tests / Documentation