Agentic AI - Monitoring/Observability/Alerting
ollama list
Host (5000)
To check for app-server errors and monitor the performance of your autonomous SRE agent, you can use specific Prometheus queries for the infrastructure and Arize Phoenix for the AI's internal logic.
- Prometheus Queries for app-server Since your app.py exposes a counter called http_requests_total with a status label, you can use these PromQL (Prometheus Query Language) expressions:
Current Error Rate Percentage: This is the most critical SRE metric. It calculates the ratio of failed requests (5xx) to total requests over the last 5 minutes.
- Things to Check in Arize Phoenix While Prometheus monitors the server, Arize Phoenix monitors the AI Agent. When the agent runs an investigation, you should check the following in the Phoenix UI (http://localhost:6006):
Tool Spans: Look for the specific step where the agent called the Prometheus Query Tool. You can see the exact PromQL query the agent generated and the raw data it received back. This helps you verify if the AI is "hallucinating" queries or correctly interpreting the metrics.
LLM Spans: Inspect the "System Prompt" and "User Input" to see the context the agent was given. You can also check the latency and token usage of each decision step to optimize costs.
Execution Flow (The Trace): View the complete chain of thought, from the initial triage to the final Root Cause Analysis (RCA). Each row in a trace represents a "span" (a single operation).
Metrics Dashboard: Check the pre-defined dashboard in your project to see aggregate Error Rates for tool calls and LLM invocations. If the agent itself is failing (e.g., API timeouts), it will show up here.
Input/Output Validation: If the user gets a wrong answer, you can look at the retrieved context to see if the agent misread the Prometheus data or if it ignored a critical signal like a high error spike.
Explanatory Summary for the POC Progressive Exposure: The system avoids a "big bang" release. It uses the K8sRolloutTool to scale the Canary to only 1 node (33% of the 3-node cluster) initially.
Automated Validation: Instead of waiting for a human, the Traffic Analyst agent performs 5 probes. It interprets random HTTP 500 errors as a "Regression," effectively acting as a digital first-responder.
Autonomous Remediation: If the probes pass, the Release Engineer executes a "Promote" (scales v2 to 3, kills v1). If they fail, it executes an instant "Rollback" (kills v2).
Auditability: Every decision—from the initial replica count to the final rollback command—is recorded in Arize Phoenix. This ensures that even though the agent is autonomous, its reasoning is 100% transparent to human auditors.