Skip to content

Commit a986ab8

Browse files
authored
Add Claude Code observability plugin with skills, tests, and docs (opensearch-project#120)
1 parent 2ecd1bc commit a986ab8

37 files changed

Lines changed: 8343 additions & 0 deletions

File tree

.claude-plugin/marketplace.json

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
{
2+
"name": "observability-stack",
3+
"owner": {
4+
"name": "OpenSearch Project",
5+
"email": "anirudha@nyu.edu"
6+
},
7+
"metadata": {
8+
"description": "Observability plugins for the OpenSearch stack"
9+
},
10+
"plugins": [
11+
{
12+
"name": "observability",
13+
"source": "./claude-code-observability-plugin",
14+
"description": "Query and investigate traces, logs, and metrics from an OpenSearch-based observability stack using PPL and PromQL",
15+
"version": "1.0.0",
16+
"author": {
17+
"name": "OpenSearch Project"
18+
}
19+
}
20+
]
21+
}
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
name: Claude Code Plugin Release
2+
3+
on:
4+
release:
5+
types: [published]
6+
workflow_dispatch:
7+
8+
jobs:
9+
build-plugin-zips:
10+
name: Build Plugin ZIPs
11+
runs-on: ubuntu-latest
12+
permissions:
13+
contents: write
14+
steps:
15+
- uses: actions/checkout@v4
16+
17+
- name: Build skill ZIP files
18+
run: |
19+
PLUGIN_DIR=claude-code-observability-plugin
20+
DIST_DIR=$PLUGIN_DIR/dist
21+
mkdir -p "$DIST_DIR"
22+
23+
for skill_dir in "$PLUGIN_DIR"/skills/*/; do
24+
skill_name=$(basename "$skill_dir")
25+
if [ -f "$skill_dir/SKILL.md" ]; then
26+
zip -j "$DIST_DIR/${skill_name}.zip" "$skill_dir/SKILL.md"
27+
echo "Built $DIST_DIR/${skill_name}.zip"
28+
fi
29+
done
30+
31+
ls -la "$DIST_DIR"
32+
33+
- name: Upload ZIPs as artifacts
34+
uses: actions/upload-artifact@v4
35+
with:
36+
name: claude-code-plugin-skills
37+
path: claude-code-observability-plugin/dist/*.zip
38+
39+
- name: Attach ZIPs to release
40+
if: github.event_name == 'release'
41+
env:
42+
GH_TOKEN: ${{ github.token }}
43+
run: |
44+
for zip in claude-code-observability-plugin/dist/*.zip; do
45+
gh release upload "${{ github.event.release.tag_name }}" "$zip" --clobber
46+
echo "Uploaded $(basename $zip) to release ${{ github.event.release.tag_name }}"
47+
done
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
{
2+
"name": "opensearch@observability",
3+
"version": "1.0.0",
4+
"description": "Query and investigate traces, logs, and metrics from an OpenSearch-based observability stack using PPL and PromQL",
5+
"author": {
6+
"name": "OpenSearch Project",
7+
"url": "https://github.com/opensearch-project/observability-stack"
8+
},
9+
"homepage": "https://observability.opensearch.org/docs/claude-code/",
10+
"repository": "https://github.com/opensearch-project/observability-stack",
11+
"license": "Apache-2.0",
12+
"keywords": ["observability", "opensearch", "traces", "logs", "metrics", "ppl", "promql", "opentelemetry"]
13+
}
Lines changed: 103 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,103 @@
1+
# OpenSearch Observability Plugin for Claude Code
2+
3+
This plugin teaches Claude Code how to query and investigate traces, logs, and metrics from an OpenSearch-based observability stack. It provides nine skill files containing PPL (Piped Processing Language) query templates for OpenSearch, PromQL query templates for Prometheus, and curl-based commands — all ready to execute against a running stack.
4+
5+
## Skill Routing Table
6+
7+
Load the appropriate skill file based on the user's intent:
8+
9+
| Skill | When to Use |
10+
|---|---|
11+
| `skills/traces/SKILL.md` | Use when investigating agent invocations, tool executions, slow spans, error spans, token usage, or trace correlation |
12+
| `skills/logs/SKILL.md` | Use when searching logs by severity, correlating logs with traces, identifying error patterns, or analyzing log volume |
13+
| `skills/metrics/SKILL.md` | Use when querying HTTP request rates, latency percentiles, error rates, active connections, or GenAI metrics |
14+
| `skills/stack-health/SKILL.md` | Use when checking stack component health, troubleshooting data flow issues, or verifying service status |
15+
| `skills/ppl-reference/SKILL.md` | Use when constructing novel PPL queries, looking up PPL syntax, or understanding PPL functions |
16+
| `skills/correlation/SKILL.md` | Use when performing cross-signal correlation between traces, logs, and metrics |
17+
| `skills/apm-red/SKILL.md` | Use when analyzing RED metrics (Rate, Errors, Duration) for service-level monitoring |
18+
| `skills/slo-sli/SKILL.md` | Use when defining SLOs/SLIs, calculating error budgets, or setting up burn rate alerts |
19+
| `skills/osd-config/SKILL.md` | Use when discovering index patterns, workspaces, saved objects, APM configs, or field mappings from OpenSearch Dashboards or OpenSearch APIs |
20+
21+
## Configuration
22+
23+
### Environment Variables
24+
25+
Set these environment variables to override default endpoints:
26+
27+
- `$OPENSEARCH_ENDPOINT` — OpenSearch base URL (default: `https://localhost:9200`)
28+
- `$PROMETHEUS_ENDPOINT` — Prometheus base URL (default: `http://localhost:9090`)
29+
30+
### Connection Profiles
31+
32+
#### Local Stack (Default)
33+
34+
| Service | Endpoint | Auth |
35+
|---|---|---|
36+
| OpenSearch | `https://localhost:9200` | `-u admin:'My_password_123!@#' -k` (HTTPS + basic auth, skip TLS verify) |
37+
| Prometheus | `http://localhost:9090` | None (HTTP, no auth) |
38+
39+
Example OpenSearch curl:
40+
41+
```bash
42+
curl -sk -u admin:'My_password_123!@#' \
43+
-X POST https://localhost:9200/_plugins/_ppl \
44+
-H 'Content-Type: application/json' \
45+
-d '{"query": "source=otel-v1-apm-span-* | head 10"}'
46+
```
47+
48+
Example Prometheus curl:
49+
50+
```bash
51+
curl -s 'http://localhost:9090/api/v1/query' \
52+
--data-urlencode 'query=up'
53+
```
54+
55+
#### AWS Managed Services
56+
57+
##### Amazon OpenSearch Service
58+
59+
- Endpoint format: `https://DOMAIN-ID.REGION.es.amazonaws.com`
60+
- Auth: AWS Signature Version 4
61+
62+
```bash
63+
curl -s --aws-sigv4 "aws:amz:REGION:es" \
64+
--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
65+
-X POST https://DOMAIN-ID.REGION.es.amazonaws.com/_plugins/_ppl \
66+
-H 'Content-Type: application/json' \
67+
-d '{"query": "source=otel-v1-apm-span-* | head 10"}'
68+
```
69+
70+
##### Amazon Managed Service for Prometheus (AMP)
71+
72+
- Endpoint format: `https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query`
73+
- Auth: AWS Signature Version 4
74+
75+
```bash
76+
curl -s --aws-sigv4 "aws:amz:REGION:aps" \
77+
--user "$AWS_ACCESS_KEY_ID:$AWS_SECRET_ACCESS_KEY" \
78+
'https://aps-workspaces.REGION.amazonaws.com/workspaces/WORKSPACE_ID/api/v1/query' \
79+
--data-urlencode 'query=up'
80+
```
81+
82+
> **Note:** PPL and PromQL query syntax is identical across local and AWS managed profiles. Only the endpoint URL and authentication method differ.
83+
84+
## Port Reference
85+
86+
| Component | Port | Protocol |
87+
|---|---|---|
88+
| OpenSearch | 9200 | HTTPS |
89+
| OTel Collector (gRPC) | 4317 | gRPC |
90+
| OTel Collector (HTTP) | 4318 | HTTP |
91+
| Data Prepper | 21890 | HTTP |
92+
| Prometheus | 9090 | HTTP |
93+
| OpenSearch Dashboards | 5601 | HTTP |
94+
95+
## Index Patterns
96+
97+
| Signal | Index Pattern | Key Fields |
98+
|---|---|---|
99+
| Traces | `otel-v1-apm-span-*` | `traceId`, `spanId`, `serviceName`, `name`, `durationInNanos`, `status.code`, `attributes.gen_ai.*` |
100+
| Logs | `logs-otel-v1-*` | `traceId`, `spanId`, `severityText`, `body`, `resource.attributes.service.name`, `@timestamp` |
101+
| Service Maps | `otel-v2-apm-service-map-*` | `sourceNode`, `targetNode`, `sourceOperation`, `targetOperation` |
102+
103+
> **Note:** The log index uses `resource.attributes.service.name` (backtick-quoted in PPL) instead of `serviceName`. The trace span index has a top-level `serviceName` field.
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
# Installation Guide
2+
3+
## Prerequisites
4+
5+
1. **Claude Code CLI** — Install from [claude.ai/claude-code](https://claude.ai/claude-code)
6+
2. **Running Observability Stack** — The plugin queries a local OpenSearch + Prometheus stack
7+
8+
### Start the Observability Stack
9+
10+
```bash
11+
git clone https://github.com/opensearch-project/observability-stack.git
12+
cd observability-stack
13+
docker compose up -d
14+
```
15+
16+
Verify services are running:
17+
18+
```bash
19+
# OpenSearch (should return cluster health JSON)
20+
curl -sk -u 'admin:My_password_123!@#' https://localhost:9200/_cluster/health?pretty
21+
22+
# Prometheus (should return "Prometheus Server is Healthy.")
23+
curl -s http://localhost:9090/-/healthy
24+
```
25+
26+
## Install the Plugin
27+
28+
From the `observability-stack` repository root:
29+
30+
```bash
31+
claude install-plugin ./claude-code-observability-plugin
32+
```
33+
34+
Or install directly from GitHub:
35+
36+
```bash
37+
claude install-plugin https://github.com/opensearch-project/observability-stack/tree/main/claude-code-observability-plugin
38+
```
39+
40+
## Verify Installation
41+
42+
Start Claude Code and try a query:
43+
44+
```
45+
claude
46+
> Show me the top 10 services by trace span count
47+
```
48+
49+
Claude should execute a PPL query against OpenSearch and return results. You can also try:
50+
51+
```
52+
> Check the health of the observability stack
53+
> Show me error logs from the last hour
54+
> What is the p95 latency for all services?
55+
```
56+
57+
## Configuration
58+
59+
### Default Endpoints
60+
61+
| Service | Endpoint | Auth |
62+
|---|---|---|
63+
| OpenSearch | `https://localhost:9200` | `admin` / `My_password_123!@#` (HTTPS, `-k` flag) |
64+
| Prometheus | `http://localhost:9090` | None |
65+
66+
### Custom Endpoints
67+
68+
Override defaults with environment variables:
69+
70+
```bash
71+
export OPENSEARCH_ENDPOINT=https://my-opensearch:9200
72+
export PROMETHEUS_ENDPOINT=http://my-prometheus:9090
73+
```
74+
75+
### AWS Managed Services
76+
77+
The plugin supports Amazon OpenSearch Service and Amazon Managed Service for Prometheus. Queries use AWS SigV4 authentication instead of basic auth. See the skill files for AWS-specific curl examples.
78+
79+
## Available Skills
80+
81+
| Skill | Description |
82+
|---|---|
83+
| `traces` | Query trace spans — agent invocations, tool executions, latency, errors |
84+
| `logs` | Search and analyze logs — severity filtering, body search, error patterns |
85+
| `metrics` | Query Prometheus metrics — HTTP rates, latency percentiles, GenAI tokens |
86+
| `stack-health` | Check component health, verify data ingestion, troubleshoot issues |
87+
| `ppl-reference` | Comprehensive PPL syntax reference with observability examples |
88+
| `correlation` | Cross-signal correlation between traces, logs, and metrics |
89+
| `apm-red` | RED metrics (Rate, Errors, Duration) for service monitoring |
90+
| `slo-sli` | SLO/SLI definitions, error budgets, and burn rate alerting |
91+
92+
## Running Tests
93+
94+
```bash
95+
cd claude-code-observability-plugin/tests
96+
pip install -r requirements.txt
97+
98+
# All tests (requires running stack)
99+
pytest -v
100+
101+
# Property tests only (no stack needed)
102+
pytest test_properties.py -v
103+
104+
# Filter by skill
105+
pytest -m traces
106+
pytest -m logs
107+
pytest -m metrics
108+
```
109+
110+
## Troubleshooting
111+
112+
### "Observability stack is not running"
113+
114+
Tests and skills require OpenSearch and Prometheus to be running locally. Start them with:
115+
116+
```bash
117+
docker compose up -d opensearch prometheus
118+
```
119+
120+
### OpenSearch returns "Unauthorized"
121+
122+
Check the password in `.env` matches what you're using. Default: `My_password_123!@#`
123+
124+
### No trace/log data
125+
126+
The observability stack includes example services (canary, weather-agent, travel-planner) that generate telemetry data automatically. Ensure they're running:
127+
128+
```bash
129+
docker compose ps | grep -E "canary|weather|travel"
130+
```
131+
132+
If not running, check that `INCLUDE_COMPOSE_EXAMPLES=docker-compose.examples.yml` is set in `.env`.
133+
134+
### Prometheus OOM / crash-looping
135+
136+
If Prometheus is crash-looping (exit code 137), its WAL may be corrupted. Clear the volume and restart:
137+
138+
```bash
139+
docker compose stop prometheus
140+
docker compose rm -f prometheus
141+
docker volume rm observability-stack_prometheus-data
142+
docker compose up -d prometheus
143+
```
144+
145+
## Index Reference
146+
147+
| Signal | Index Pattern | Key Fields |
148+
|---|---|---|
149+
| Traces | `otel-v1-apm-span-*` | `traceId`, `spanId`, `serviceName`, `name`, `durationInNanos`, `status.code` |
150+
| Logs | `logs-otel-v1-*` | `traceId`, `spanId`, `severityText`, `body`, `resource.attributes.service.name` |
151+
| Service Maps | `otel-v2-apm-service-map-*` | `sourceNode`, `targetNode`, `sourceOperation`, `targetOperation` |

0 commit comments

Comments
 (0)