[PROPOSAL] RFC: One MCP server, many interactive Apps for the OpenSearch observability stack



## Summary

Add a suite of **MCP Apps** to observability-stack: MCP tools that return an interactive UI inline in MCP-compatible hosts (Claude Desktop, Claude.ai, VS Code Copilot, Cursor, Kiro) alongside the compact text summary the model reasons over. The Apps are served by a **single MCP server** for the whole observability stack rather than one server per surface, so shared capabilities (authentication, time and data routing, correlation, save-and-share, inspectability) are defined once at the server and inherited by every App.

The surface is derived from a jobs-to-be-done decomposition of the existing console: **295 jobs across 15 families, served by ~51 composable Apps over ~285 shared UI components**. Each job is independently shippable, so this lands incrementally rather than as a single release.

## Motivation

Today the observability-stack MCP server returns text and JSON. An agent calls a tool, summarizes the result for the human, and the human either trusts the summary or leaves the conversation to verify it in OpenSearch Dashboards. The summary is lossy by definition: a thousand-span trace becomes a sentence about how long the agent ran, a latency histogram becomes "p99 is high." Two things break at once. The human loses any way to verify the claim in place, and in an autonomous loop the model loses the deterministic signal it needs to choose its next step.

We need a surface that:

- Lets a single agent move from a firing alert, to the trace that explains it, to the logs that span emitted, to the dashboard that tracks the trend, **in one conversation, without losing context** (workspace, time range, `traceId`, auth token).
- Returns a **deterministic UI for the human to verify** next to the agent's reasoning, from the same tool call, without a tab switch or re-authentication.
- Works across **MCP-spec-compliant hosts** with no host-specific assumptions.
- Keeps everything else unchanged: data stays in OpenSearch, PPL for logs and traces, PromQL for Prometheus, the OTel pipeline (OTel Collector, Data Prepper), SLOs, alerting rules, and dashboards all survive, and the console remains for users who prefer it.

## Proposal

### Architecture: one MCP server, many Apps

```
MCP host (Claude Desktop / Claude.ai / VS Code Copilot / Cursor / Kiro)
        │
        │  MCP (stdio or HTTP)
        ▼
OpenSearch Observability MCP App server  ──►  data access via opensearch-mcp-server-py
   ├─ model-facing tools  →  return { text summary, UI resource }
   ├─ app-only tools      →  called by the UI through the host bridge for fresh data
   └─ shared server capabilities (inherited by every App):
        auth · time/data routing · correlation · save-and-share · inspectability
        │
        ▼
OpenSearch (logs, traces, service map) · Prometheus / Amazon Managed Prometheus (metrics)
```

- Each **App** is a model-facing MCP tool whose result carries both a compact text summary and a self-contained UI resource the host renders in a sandboxed iframe.
- The UI can call **app-only tools** through the host bridge to fetch fresh data or persist state, without a separate API layer or auth plumbing.
- Capabilities are **defined once at the server**. Every App that joins the catalog inherits them, and every job written against the catalog gets them for free.

### Tool result contract (two readers)

A tool result serves two audiences from one return value:

- **Model**: structured text (counts, top offenders, the fields needed to decide the next call). Cheap to reason over, lossless on the decision-driving facts.
- **Human**: the interactive widget (alert table, trace tree, latency chart, burn-rate gauge) rendered inline.

Design rule learned from evaluation (see Validation): **aggregate before you drill in**. On error hunts the agent must see the failure-mode distribution before any single trace, and that distribution must appear in **both** the tool text and a widget card, with guidance to drill into the dominant bucket rather than the slowest trace. Surfacing the aggregate first measurably improved root-cause accuracy.

### Job catalog and scope (JTBD)

| Family | Jobs | Outcome |
|---|---|---|
| Triage & Response | 20 | Scan what is firing, narrow, read the backing monitor, ack/mute/assign, unify OpenSearch Alerting with Cortex rules |
| Investigate Logs | 15 | Pre-filtered logs by service and window, PPL with autocomplete, chart, pattern clustering, baseline diff, pivot to trace |
| Investigate Traces | 25 | Root-span list, trace tree / DAG / Gantt with synced selection, slow and error spans, span links, pivot to logs |
| Investigate Metrics | 20 | Metric catalog with sparklines, label filter, PromQL builder/code, 10+ chart types, RED per service |
| Service & Application Performance | 30 | Application/service catalogs with health KPIs, per-service KPI cards, dependency and operation views, correlation flyouts |
| Service Map & Topology | 10 | Live topology, filter by fault/error rate/environment, service drill-in, card-grid grouping |
| SLOs & Error Budgets | 15 | Browse SLOs, target/SLI/budget, burn-rate gauge, remaining-budget plot, multi-window burn-rate alerts |
| Anomaly Detection & Forecasting | 15 | Browse/create detectors, anomaly grade and feature contribution, forecasters, wire to alerting |
| Dashboards | 30 | Render inline, stack/pin filters, PPL/PromQL panels, variables and transforms, export PDF/PNG/CSV/JSON |
| Datasets, Datasources & Correlations | 20 | Browse/create logs and traces datasets, schema mapping, trace↔logs linking, Prometheus datasources, guided cross-signal |
| Workspaces, Index Patterns & Saved Objects | 15 | List workspaces and datasources, index patterns with signal-type badges, field mappings, query diagnostics |
| Ingestion, APM Setup & Sizing | 15 | APM setup wizard, OTel Collector and Data Prepper YAML, ingestion verification, trace-storage sizing |
| AI / Agent Observability | 45 | Agent traces and `gen_ai` spans, span-category badges, conversation tracking, tool-call inspection, eval, DeepEval/RAGAS/MLflow bridges |
| Stack Health & Quickstart | 15 | One-glance health, per-component status, doc-landing checks, system doctor, first-trace/first-dashboard tutorials |
| Generative Panels | 5 | One-off custom panel, sandbox-render generated React, validate against PPL and OUI references |
| **Total** | **295** | 15 families, ~51 surfaces, ~285 UI components |

Job counts are the size of the addressable backlog per family, not an engineering estimate. Components shared across surfaces are the reuse backbone and should live in one shared library.

### Correlation

`traceId` and `spanId` are a **universal pivot defined at the server**. The alerts App emits a trace ID, the traces App receives one, the logs App filters on one, the dashboard variable accepts one. None of the Apps know about each other. Cross-signal join uses PPL with a two-query fallback; the server also supports exemplar metric→trace, service-map node→service detail, and span→related logs.

### Multi-cloud and auth

Datasources, workspaces, and authentication modes (none, basic, bearer, api-key, AWS SigV4) are first-class server concepts. A single project can observe self-managed OpenSearch, Amazon OpenSearch Service, and Amazon Managed Prometheus at once, with each query routed to the correct backend. The agent does not need to know which cluster anything lives in.

### Agent observability (first-class)

Six span categories (Agent, LLM, Tool, Content, Embeddings, Retrieval) color-coded and aggregated; hierarchical trace, DAG, and Gantt views; conversation grouping by `gen_ai.conversation.id`; tool-call inspection with raw arguments and results; golden-path evaluation as the agent equivalent of a unit test. This integrates with the existing `opensearch-genai-observability-sdk-py` SDK and Agent Health (`@opensearch-project/agent-health`). Tracks toward closing #42 (missing index-pattern fields for GenAI attributes) and #113 (agent-developer onboarding).

### Packaging and distribution

| Channel | Mechanism | Notes |
|---|---|---|
| `.mcpb` bundle | One-click install in Claude Desktop / Claude.ai | One per domain, from GitHub Releases |
| `SKILL.md` zips | Teach the host agent when to invoke each App | One per domain, co-released with the `.mcpb` |
| OpenSearch MCP server | Register Apps as callable tools | No per-user install; depends on the server tool contract |
| VS Code / Cursor | MCP App spec, same `.mcpb` | Zero extra build once bundles ship |
| Kiro extension | Open VSX plugin with WebviewPanel renderer | Kiro does not render MCP App UIs natively; needs a dedicated extension |

### Compatibility

Every job in the catalog must work across MCP-spec-compliant hosts with no host-specific assumptions in the contract. CI produces signed `.mcpb` artifacts and screenshots on every release tag.

## Validation

Two independent evaluations already exist for the agentic root-cause workflow over the MCP tools. The App layer adds the human-verifiable surface on top of the same tools.

- **Reproducible RCA benchmark** (OTel Demo with injected faults, MCP tools only, no shell, independent LLM judge, 3 replicates per case): **87.0% mean accuracy, 85.2% pass rate (23/27 runs)**, by difficulty 91.7% easy / 100% medium / 66.7% hard.
- **Live scenario suite (v3)** against the public OpenSearch playground, driven headless: **19 of 19 incident scenarios root-caused**, from a single feature-flag failure to a 600-second gRPC `DEADLINE_EXCEEDED` on a shared dependency to a Postgres `23505` unique-constraint violation.
- The scenario suite is also where the **aggregate-before-drill-in** rule above came from: adding a failure-mode breakdown to the trace finder flipped three previously failing or partial incidents (checkout-blocked-on-PlaceOrder, travel-planner sub-agent failures, events-agent external dependency) to correct.

## Alternatives considered

1. **One server per domain** (the common early pattern). Rejected: every cross-domain pivot crosses a server boundary and loses the workspace, time range, trace ID, and auth token. Cohesion ends up reimplemented as glue in each App, or not at all.
2. **Text-only MCP tools (status quo).** Rejected: lossy summaries force humans out of the conversation to verify, and the model loses the deterministic signal it needs to act.
3. **Port the console into a single webview.** Rejected: the console is page-shaped, not job-shaped, and does not compose into a conversation. The unit of work in an agent surface is a job, not a page.

## References

- MCP Apps specification: https://modelcontextprotocol.io/extensions/apps/overview
- OpenTelemetry GenAI semantic conventions: https://opentelemetry.io/docs/specs/semconv/gen-ai/
- Agent Traces RFC (OpenSearch-Dashboards): https://github.com/opensearch-project/OpenSearch-Dashboards/issues/11345
- RFC #27, Versioning Strategy for observability-stack: https://github.com/opensearch-project/observability-stack/issues/27
- Related issues: #42 (GenAI index-pattern fields), #113 (agent-developer onboarding), #112 (APM onboarding)
- AI Observability docs: https://observability.opensearch.org/docs/ai-observability/
- Agent Tracing docs: https://observability.opensearch.org/docs/ai-observability/agent-tracing/
- Python SDK docs: https://observability.opensearch.org/docs/sdks/python/
- MCP Server docs: https://observability.opensearch.org/docs/mcp/


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROPOSAL] RFC: One MCP server, many interactive Apps for the OpenSearch observability stack #280

Summary

Motivation

Proposal

Architecture: one MCP server, many Apps

Tool result contract (two readers)

Job catalog and scope (JTBD)

Correlation

Multi-cloud and auth

Agent observability (first-class)

Packaging and distribution

Compatibility

Validation

Alternatives considered

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Family	Jobs	Outcome
Triage & Response	20	Scan what is firing, narrow, read the backing monitor, ack/mute/assign, unify OpenSearch Alerting with Cortex rules
Investigate Logs	15	Pre-filtered logs by service and window, PPL with autocomplete, chart, pattern clustering, baseline diff, pivot to trace
Investigate Traces	25	Root-span list, trace tree / DAG / Gantt with synced selection, slow and error spans, span links, pivot to logs
Investigate Metrics	20	Metric catalog with sparklines, label filter, PromQL builder/code, 10+ chart types, RED per service
Service & Application Performance	30	Application/service catalogs with health KPIs, per-service KPI cards, dependency and operation views, correlation flyouts
Service Map & Topology	10	Live topology, filter by fault/error rate/environment, service drill-in, card-grid grouping
SLOs & Error Budgets	15	Browse SLOs, target/SLI/budget, burn-rate gauge, remaining-budget plot, multi-window burn-rate alerts
Anomaly Detection & Forecasting	15	Browse/create detectors, anomaly grade and feature contribution, forecasters, wire to alerting
Dashboards	30	Render inline, stack/pin filters, PPL/PromQL panels, variables and transforms, export PDF/PNG/CSV/JSON
Datasets, Datasources & Correlations	20	Browse/create logs and traces datasets, schema mapping, trace↔logs linking, Prometheus datasources, guided cross-signal
Workspaces, Index Patterns & Saved Objects	15	List workspaces and datasources, index patterns with signal-type badges, field mappings, query diagnostics
Ingestion, APM Setup & Sizing	15	APM setup wizard, OTel Collector and Data Prepper YAML, ingestion verification, trace-storage sizing
AI / Agent Observability	45	Agent traces and `gen_ai` spans, span-category badges, conversation tracking, tool-call inspection, eval, DeepEval/RAGAS/MLflow bridges
Stack Health & Quickstart	15	One-glance health, per-component status, doc-landing checks, system doctor, first-trace/first-dashboard tutorials
Generative Panels	5	One-off custom panel, sandbox-render generated React, validate against PPL and OUI references
Total	295	15 families, ~51 surfaces, ~285 UI components

Channel	Mechanism	Notes
`.mcpb` bundle	One-click install in Claude Desktop / Claude.ai	One per domain, from GitHub Releases
`SKILL.md` zips	Teach the host agent when to invoke each App	One per domain, co-released with the `.mcpb`
OpenSearch MCP server	Register Apps as callable tools	No per-user install; depends on the server tool contract
VS Code / Cursor	MCP App spec, same `.mcpb`	Zero extra build once bundles ship
Kiro extension	Open VSX plugin with WebviewPanel renderer	Kiro does not render MCP App UIs natively; needs a dedicated extension

[PROPOSAL] RFC: One MCP server, many interactive Apps for the OpenSearch observability stack #280

Description

Summary

Motivation

Proposal

Architecture: one MCP server, many Apps

Tool result contract (two readers)

Job catalog and scope (JTBD)

Correlation

Multi-cloud and auth

Agent observability (first-class)

Packaging and distribution

Compatibility

Validation

Alternatives considered

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions