Skip to content

feat: add model health monitoring and trace context logging#58

Closed
vlordier wants to merge 1 commit into
Liquid4All:mainfrom
vlordier:feat/health-monitoring-trace-context
Closed

feat: add model health monitoring and trace context logging#58
vlordier wants to merge 1 commit into
Liquid4All:mainfrom
vlordier:feat/health-monitoring-trace-context

Conversation

@vlordier
Copy link
Copy Markdown

@vlordier vlordier commented Mar 6, 2026

Add ModelStatus struct, get_status method, and trace_id logging

1. Model Health Check + Auto-Recovery:
   - Add ModelStatus struct with key, display_name, base_url, healthy, model_name, error
   - Add InferenceClient::get_status() method for detailed health status
   - Health check hits /models endpoint to verify connectivity

2. Parallel MCP Server Startup:
   - Already implemented in lifecycle.rs using tokio::spawn
   - All servers start concurrently

3. Structured Logging with Trace Context:
   - Add trace_id (8-char UUID) to each send_message request
   - Log trace_id at start of message processing
   - Enables correlation across logs for debugging

Testing: 365 tests pass, clippy clean
Copilot AI review requested due to automatic review settings March 6, 2026 20:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds health/status reporting for the currently selected inference model and introduces per-request trace ID logging to improve observability in the LocalCowork Tauri app.

Changes:

  • Adds a ModelStatus struct for serializable model health/status reporting.
  • Introduces InferenceClient::get_status() to return endpoint + health details.
  • Logs a generated trace_id at the start of send_message() for log correlation.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
examples/localcowork/src-tauri/src/inference/types.rs Adds ModelStatus type (serde-serializable) for health monitoring consumers.
examples/localcowork/src-tauri/src/inference/client.rs Adds get_status() API and extends tests with an LM Studio model config case.
examples/localcowork/src-tauri/src/commands/chat.rs Generates and logs a trace_id at the beginning of message processing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +429 to +432
pub async fn get_status(&self) -> super::types::ModelStatus {
let url = format!("{}/models", self.current_model.base_url);
match self.http.get(&url).timeout(CONNECT_TIMEOUT).send().await {
Ok(resp) if resp.status().is_success() => super::types::ModelStatus {
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This uses CONNECT_TIMEOUT (documented as a TCP connect timeout) as a per-request total timeout via .timeout(CONNECT_TIMEOUT). That’s confusing and easy to misuse because reqwest::RequestBuilder::timeout applies to the entire request/response. Consider introducing a dedicated short HEALTH_CHECK_TIMEOUT (or similar) for health/status calls, or updating the constant name/doc to match its actual use here.

Copilot uses AI. Check for mistakes.
Comment on lines +1244 to +1247
let trace_id = uuid::Uuid::new_v4().to_string()[..8].to_string();

tracing::info!(trace_id = %trace_id, session_id = %session_id, content_len = content.len(), "starting message processing");

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trace_id is generated but only attached to a single log record. If the goal is request-level correlation across subsequent logs, consider creating a tracing span that includes trace_id (and entering it for the rest of the function) so downstream tracing::info!/debug! calls inherit the same context. Also, truncating to 8 hex chars (32 bits) increases the chance of collisions in logs; using the full UUID or a longer prefix would reduce that risk.

Suggested change
let trace_id = uuid::Uuid::new_v4().to_string()[..8].to_string();
tracing::info!(trace_id = %trace_id, session_id = %session_id, content_len = content.len(), "starting message processing");
let trace_id = Uuid::new_v4().to_string();
// Create a span so all subsequent logs inherit the same trace_id/session_id context.
let span = tracing::info_span!("send_message", trace_id = %trace_id, session_id = %session_id);
let _span_guard = span.enter();
tracing::info!(content_len = content.len(), "starting message processing");

Copilot uses AI. Check for mistakes.
Comment on lines +440 to +446
Ok(resp) => super::types::ModelStatus {
key: self.current_model_key.clone(),
display_name: self.current_model.display_name.clone(),
base_url: self.current_model.base_url.clone(),
healthy: false,
model_name: None,
error: Some(format!("HTTP {}", resp.status())),
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_status sets model_name: None for non-2xx and transport errors, even though the client already knows the configured model name/key. This makes the status payload less useful for UIs/health dashboards because you lose which model is unhealthy. Consider always populating model_name (e.g., from current_model.model_name falling back to current_model_key) and reserve error/healthy for the health signal.

Copilot uses AI. Check for mistakes.
vlordier added a commit to vlordier/cookbook that referenced this pull request Mar 6, 2026
- Add 3 attempts with exponential backoff to health_check()
- Improves reliability of PR Liquid4All#58 health monitoring by handling transient failures

This helps avoid false negatives when model server is temporarily slow.
@Paulescu
Copy link
Copy Markdown
Collaborator

Closing this one. The three changes (ModelStatus, get_status, trace_id) each land as a primitive with no consumer, so nothing in the app actually exercises them.

get_status is dead code. No Tauri command, chat handler, or orchestrator path calls it. ModelStatus is referenced only as the return type of the method that nobody invokes. For a "health monitoring" feature, the integration is the point — a Tauri command like get_model_status that the frontend calls, ideally with a polling story or an event when health changes.

trace_id is logged once and never propagated. send_message generates a trace_id and emits a single tracing::info! with it, then the rest of the function logs without the field. There's nothing to correlate. The idiomatic shape is either #[tracing::instrument(fields(trace_id))] on send_message or let _span = tracing::info_span!("send_message", trace_id = %trace_id).entered(); at the top of the body — both attach the field to every event in the span automatically.

Hardcodes /models instead of using runtime.health_check. _models/config.yaml declares per-runtime health endpoints (Ollama: /api/tags, llama_cpp: /health, lmstudio: /v1/models). get_status unconditionally hits {base_url}/models, which 404s on Ollama. Read the configured health_check URL for the runtime, fall back to /v1/models only if absent.

Smuggled content from #53. The lmstudio-model test fixture and test_lmstudio_model_config test in inference/client.rs are the same lines added by #53. Not blocking on their own, but the partial overlap will conflict on rebase. Once #53 is resolved, drop those hunks from this branch.

Smaller nits:

  • uuid::Uuid::new_v4().to_string()[..8].to_string() is fragile and gives 32 bits of entropy. If short IDs are wanted, use a counter or nanoid. The [..8] is ASCII-safe only because of UUID's hex format.
  • The multi-line assert! reformatting in types.rs is unrelated rustfmt churn.
  • PR description is one sentence with no motivation or integration plan.

If you want to resubmit, the rule of thumb is: ship the primitive and its first consumer in the same PR. Add get_status, expose it as get_model_status Tauri command, and wire the frontend (or at least an integration point) that calls it. Add trace_id propagation via a span on the request handler. With those, this becomes a reviewable feature.

@Paulescu Paulescu closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants