Summary
Add a suite of MCP Apps to observability-stack: MCP tools that return an interactive UI inline in MCP-compatible hosts (Claude Desktop, Claude.ai, VS Code Copilot, Cursor, Kiro) alongside the compact text summary the model reasons over. The Apps are served by a single MCP server for the whole observability stack rather than one server per surface, so shared capabilities (authentication, time and data routing, correlation, save-and-share, inspectability) are defined once at the server and inherited by every App.
The surface is derived from a jobs-to-be-done decomposition of the existing console: 295 jobs across 15 families, served by ~51 composable Apps over ~285 shared UI components. Each job is independently shippable, so this lands incrementally rather than as a single release.
Motivation
Today the observability-stack MCP server returns text and JSON. An agent calls a tool, summarizes the result for the human, and the human either trusts the summary or leaves the conversation to verify it in OpenSearch Dashboards. The summary is lossy by definition: a thousand-span trace becomes a sentence about how long the agent ran, a latency histogram becomes "p99 is high." Two things break at once. The human loses any way to verify the claim in place, and in an autonomous loop the model loses the deterministic signal it needs to choose its next step.
We need a surface that:
- Lets a single agent move from a firing alert, to the trace that explains it, to the logs that span emitted, to the dashboard that tracks the trend, in one conversation, without losing context (workspace, time range,
traceId, auth token).
- Returns a deterministic UI for the human to verify next to the agent's reasoning, from the same tool call, without a tab switch or re-authentication.
- Works across MCP-spec-compliant hosts with no host-specific assumptions.
- Keeps everything else unchanged: data stays in OpenSearch, PPL for logs and traces, PromQL for Prometheus, the OTel pipeline (OTel Collector, Data Prepper), SLOs, alerting rules, and dashboards all survive, and the console remains for users who prefer it.
Proposal
Architecture: one MCP server, many Apps
MCP host (Claude Desktop / Claude.ai / VS Code Copilot / Cursor / Kiro)
│
│ MCP (stdio or HTTP)
▼
OpenSearch Observability MCP App server ──► data access via opensearch-mcp-server-py
├─ model-facing tools → return { text summary, UI resource }
├─ app-only tools → called by the UI through the host bridge for fresh data
└─ shared server capabilities (inherited by every App):
auth · time/data routing · correlation · save-and-share · inspectability
│
▼
OpenSearch (logs, traces, service map) · Prometheus / Amazon Managed Prometheus (metrics)
- Each App is a model-facing MCP tool whose result carries both a compact text summary and a self-contained UI resource the host renders in a sandboxed iframe.
- The UI can call app-only tools through the host bridge to fetch fresh data or persist state, without a separate API layer or auth plumbing.
- Capabilities are defined once at the server. Every App that joins the catalog inherits them, and every job written against the catalog gets them for free.
Tool result contract (two readers)
A tool result serves two audiences from one return value:
- Model: structured text (counts, top offenders, the fields needed to decide the next call). Cheap to reason over, lossless on the decision-driving facts.
- Human: the interactive widget (alert table, trace tree, latency chart, burn-rate gauge) rendered inline.
Design rule learned from evaluation (see Validation): aggregate before you drill in. On error hunts the agent must see the failure-mode distribution before any single trace, and that distribution must appear in both the tool text and a widget card, with guidance to drill into the dominant bucket rather than the slowest trace. Surfacing the aggregate first measurably improved root-cause accuracy.
Job catalog and scope (JTBD)
| Family |
Jobs |
Outcome |
| Triage & Response |
20 |
Scan what is firing, narrow, read the backing monitor, ack/mute/assign, unify OpenSearch Alerting with Cortex rules |
| Investigate Logs |
15 |
Pre-filtered logs by service and window, PPL with autocomplete, chart, pattern clustering, baseline diff, pivot to trace |
| Investigate Traces |
25 |
Root-span list, trace tree / DAG / Gantt with synced selection, slow and error spans, span links, pivot to logs |
| Investigate Metrics |
20 |
Metric catalog with sparklines, label filter, PromQL builder/code, 10+ chart types, RED per service |
| Service & Application Performance |
30 |
Application/service catalogs with health KPIs, per-service KPI cards, dependency and operation views, correlation flyouts |
| Service Map & Topology |
10 |
Live topology, filter by fault/error rate/environment, service drill-in, card-grid grouping |
| SLOs & Error Budgets |
15 |
Browse SLOs, target/SLI/budget, burn-rate gauge, remaining-budget plot, multi-window burn-rate alerts |
| Anomaly Detection & Forecasting |
15 |
Browse/create detectors, anomaly grade and feature contribution, forecasters, wire to alerting |
| Dashboards |
30 |
Render inline, stack/pin filters, PPL/PromQL panels, variables and transforms, export PDF/PNG/CSV/JSON |
| Datasets, Datasources & Correlations |
20 |
Browse/create logs and traces datasets, schema mapping, trace↔logs linking, Prometheus datasources, guided cross-signal |
| Workspaces, Index Patterns & Saved Objects |
15 |
List workspaces and datasources, index patterns with signal-type badges, field mappings, query diagnostics |
| Ingestion, APM Setup & Sizing |
15 |
APM setup wizard, OTel Collector and Data Prepper YAML, ingestion verification, trace-storage sizing |
| AI / Agent Observability |
45 |
Agent traces and gen_ai spans, span-category badges, conversation tracking, tool-call inspection, eval, DeepEval/RAGAS/MLflow bridges |
| Stack Health & Quickstart |
15 |
One-glance health, per-component status, doc-landing checks, system doctor, first-trace/first-dashboard tutorials |
| Generative Panels |
5 |
One-off custom panel, sandbox-render generated React, validate against PPL and OUI references |
| Total |
295 |
15 families, ~51 surfaces, ~285 UI components |
Job counts are the size of the addressable backlog per family, not an engineering estimate. Components shared across surfaces are the reuse backbone and should live in one shared library.
Correlation
traceId and spanId are a universal pivot defined at the server. The alerts App emits a trace ID, the traces App receives one, the logs App filters on one, the dashboard variable accepts one. None of the Apps know about each other. Cross-signal join uses PPL with a two-query fallback; the server also supports exemplar metric→trace, service-map node→service detail, and span→related logs.
Multi-cloud and auth
Datasources, workspaces, and authentication modes (none, basic, bearer, api-key, AWS SigV4) are first-class server concepts. A single project can observe self-managed OpenSearch, Amazon OpenSearch Service, and Amazon Managed Prometheus at once, with each query routed to the correct backend. The agent does not need to know which cluster anything lives in.
Agent observability (first-class)
Six span categories (Agent, LLM, Tool, Content, Embeddings, Retrieval) color-coded and aggregated; hierarchical trace, DAG, and Gantt views; conversation grouping by gen_ai.conversation.id; tool-call inspection with raw arguments and results; golden-path evaluation as the agent equivalent of a unit test. This integrates with the existing opensearch-genai-observability-sdk-py SDK and Agent Health (@opensearch-project/agent-health). Tracks toward closing #42 (missing index-pattern fields for GenAI attributes) and #113 (agent-developer onboarding).
Packaging and distribution
| Channel |
Mechanism |
Notes |
.mcpb bundle |
One-click install in Claude Desktop / Claude.ai |
One per domain, from GitHub Releases |
SKILL.md zips |
Teach the host agent when to invoke each App |
One per domain, co-released with the .mcpb |
| OpenSearch MCP server |
Register Apps as callable tools |
No per-user install; depends on the server tool contract |
| VS Code / Cursor |
MCP App spec, same .mcpb |
Zero extra build once bundles ship |
| Kiro extension |
Open VSX plugin with WebviewPanel renderer |
Kiro does not render MCP App UIs natively; needs a dedicated extension |
Compatibility
Every job in the catalog must work across MCP-spec-compliant hosts with no host-specific assumptions in the contract. CI produces signed .mcpb artifacts and screenshots on every release tag.
Validation
Two independent evaluations already exist for the agentic root-cause workflow over the MCP tools. The App layer adds the human-verifiable surface on top of the same tools.
- Reproducible RCA benchmark (OTel Demo with injected faults, MCP tools only, no shell, independent LLM judge, 3 replicates per case): 87.0% mean accuracy, 85.2% pass rate (23/27 runs), by difficulty 91.7% easy / 100% medium / 66.7% hard.
- Live scenario suite (v3) against the public OpenSearch playground, driven headless: 19 of 19 incident scenarios root-caused, from a single feature-flag failure to a 600-second gRPC
DEADLINE_EXCEEDED on a shared dependency to a Postgres 23505 unique-constraint violation.
- The scenario suite is also where the aggregate-before-drill-in rule above came from: adding a failure-mode breakdown to the trace finder flipped three previously failing or partial incidents (checkout-blocked-on-PlaceOrder, travel-planner sub-agent failures, events-agent external dependency) to correct.
Alternatives considered
- One server per domain (the common early pattern). Rejected: every cross-domain pivot crosses a server boundary and loses the workspace, time range, trace ID, and auth token. Cohesion ends up reimplemented as glue in each App, or not at all.
- Text-only MCP tools (status quo). Rejected: lossy summaries force humans out of the conversation to verify, and the model loses the deterministic signal it needs to act.
- Port the console into a single webview. Rejected: the console is page-shaped, not job-shaped, and does not compose into a conversation. The unit of work in an agent surface is a job, not a page.
References
Summary
Add a suite of MCP Apps to observability-stack: MCP tools that return an interactive UI inline in MCP-compatible hosts (Claude Desktop, Claude.ai, VS Code Copilot, Cursor, Kiro) alongside the compact text summary the model reasons over. The Apps are served by a single MCP server for the whole observability stack rather than one server per surface, so shared capabilities (authentication, time and data routing, correlation, save-and-share, inspectability) are defined once at the server and inherited by every App.
The surface is derived from a jobs-to-be-done decomposition of the existing console: 295 jobs across 15 families, served by ~51 composable Apps over ~285 shared UI components. Each job is independently shippable, so this lands incrementally rather than as a single release.
Motivation
Today the observability-stack MCP server returns text and JSON. An agent calls a tool, summarizes the result for the human, and the human either trusts the summary or leaves the conversation to verify it in OpenSearch Dashboards. The summary is lossy by definition: a thousand-span trace becomes a sentence about how long the agent ran, a latency histogram becomes "p99 is high." Two things break at once. The human loses any way to verify the claim in place, and in an autonomous loop the model loses the deterministic signal it needs to choose its next step.
We need a surface that:
traceId, auth token).Proposal
Architecture: one MCP server, many Apps
Tool result contract (two readers)
A tool result serves two audiences from one return value:
Design rule learned from evaluation (see Validation): aggregate before you drill in. On error hunts the agent must see the failure-mode distribution before any single trace, and that distribution must appear in both the tool text and a widget card, with guidance to drill into the dominant bucket rather than the slowest trace. Surfacing the aggregate first measurably improved root-cause accuracy.
Job catalog and scope (JTBD)
gen_aispans, span-category badges, conversation tracking, tool-call inspection, eval, DeepEval/RAGAS/MLflow bridgesJob counts are the size of the addressable backlog per family, not an engineering estimate. Components shared across surfaces are the reuse backbone and should live in one shared library.
Correlation
traceIdandspanIdare a universal pivot defined at the server. The alerts App emits a trace ID, the traces App receives one, the logs App filters on one, the dashboard variable accepts one. None of the Apps know about each other. Cross-signal join uses PPL with a two-query fallback; the server also supports exemplar metric→trace, service-map node→service detail, and span→related logs.Multi-cloud and auth
Datasources, workspaces, and authentication modes (none, basic, bearer, api-key, AWS SigV4) are first-class server concepts. A single project can observe self-managed OpenSearch, Amazon OpenSearch Service, and Amazon Managed Prometheus at once, with each query routed to the correct backend. The agent does not need to know which cluster anything lives in.
Agent observability (first-class)
Six span categories (Agent, LLM, Tool, Content, Embeddings, Retrieval) color-coded and aggregated; hierarchical trace, DAG, and Gantt views; conversation grouping by
gen_ai.conversation.id; tool-call inspection with raw arguments and results; golden-path evaluation as the agent equivalent of a unit test. This integrates with the existingopensearch-genai-observability-sdk-pySDK and Agent Health (@opensearch-project/agent-health). Tracks toward closing #42 (missing index-pattern fields for GenAI attributes) and #113 (agent-developer onboarding).Packaging and distribution
.mcpbbundleSKILL.mdzips.mcpb.mcpbCompatibility
Every job in the catalog must work across MCP-spec-compliant hosts with no host-specific assumptions in the contract. CI produces signed
.mcpbartifacts and screenshots on every release tag.Validation
Two independent evaluations already exist for the agentic root-cause workflow over the MCP tools. The App layer adds the human-verifiable surface on top of the same tools.
DEADLINE_EXCEEDEDon a shared dependency to a Postgres23505unique-constraint violation.Alternatives considered
References