Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
7e04cb6
feat: scaffold helm umbrella chart for observability-stack
kylehounslow Mar 16, 2026
65baa71
fix: working helm values for kind deployment
kylehounslow Mar 16, 2026
f9c6718
fix: add OpenSearch credentials to OSD config
kylehounslow Mar 16, 2026
26e4917
feat: add init job for dashboards, index patterns, and saved objects
kylehounslow Mar 16, 2026
9f1937c
fix: use custom DP 2.15.0-SNAPSHOT image with prometheus sink
kylehounslow Mar 16, 2026
b0d3216
refactor: centralize OpenSearch credentials
kylehounslow Mar 16, 2026
a5b6e27
feat: add canary CronJob generating GenAI agent traces
kylehounslow Mar 16, 2026
1eb99ae
fix: use opensearchstaging 3.6.0 images to match docker-compose
kylehounslow Mar 16, 2026
4ccaa80
fix: add all OSD feature flags for agent tracing and APM
kylehounslow Mar 16, 2026
6fc630c
feat: replace fake canary with real example agents from docker-compose
kylehounslow Mar 17, 2026
ea56211
feat: add Gateway API support for OpenSearch Dashboards
kylehounslow Mar 17, 2026
33b6b60
fix: update NOTES.txt with port-forward commands and credentials
kylehounslow Mar 17, 2026
bbfee2f
fix: use correct OpenSearch service name for datasource creation
kylehounslow Mar 17, 2026
0bd5754
docs: add local TLS development guide with mkcert + Envoy Gateway
kylehounslow Mar 17, 2026
4297232
fix: mount saved queries and architecture image for init job
kylehounslow Mar 17, 2026
32bf22e
feat: add Terraform AWS EKS deployment (VPC, EKS, ALB, IRSA, helm rel…
kylehounslow Mar 18, 2026
b4f74f5
feat: add Prometheus OTLP config, GenAI semconv attrs, node-exporter,…
kylehounslow Mar 18, 2026
7c9e36d
feat: add K8s Cluster Health and Observability Pipeline Health dashbo…
kylehounslow Mar 18, 2026
5a6a9da
feat: expand self-monitoring dashboards (pods by status, memory vs re…
kylehounslow Mar 18, 2026
c12f202
feat: add OpenSearch cluster health dashboard with prometheus-opensea…
kylehounslow Mar 18, 2026
e604eef
feat: add self-monitoring to docker-compose (opensearch-exporter, pip…
kylehounslow Mar 18, 2026
c56383e
fix: prometheus ingestion rate filter + showFullTimeRange for dashboa…
kylehounslow Mar 19, 2026
642d548
sync: copy init script + dashboard YAMLs from docker-compose (source …
kylehounslow Mar 19, 2026
1aef13d
fix: respect BASE_URL env var in init script for helm compatibility
kylehounslow Mar 19, 2026
1b8eda7
sync: update pipeline dashboard YAML from upstream PR #107 (adds Data…
kylehounslow Mar 19, 2026
f2e758e
sync: copy latest AGENTS.md, docker-compose.yml, .gitignore from main
kylehounslow Mar 19, 2026
0736ad7
feat: add Data Prepper Prometheus scrape config to helm chart
kylehounslow Mar 19, 2026
840f5a4
fix: expose Data Prepper metrics port 4900 on service for Prometheus …
kylehounslow Mar 19, 2026
bf71657
fix: force ssl-redirect annotation as string type to avoid helm decod…
kylehounslow Mar 19, 2026
d429a43
docs: add helm chart README and values-eks comments
kylehounslow Mar 19, 2026
c25a193
sync: update .gitignore and AGENTS.md from main
kylehounslow Mar 19, 2026
08d337a
chore: gitignore helm dependency artifacts and terraform plan files
kylehounslow Mar 19, 2026
364c5fa
ci: add GHCR image publish workflow (fork-only, skips on upstream)
kylehounslow Mar 19, 2026
34d20aa
feat: enable example agents on EKS with GHCR images
kylehounslow Mar 19, 2026
18b7e56
feat: add OpenTelemetry Demo as optional subchart (disabled by default)
kylehounslow Mar 19, 2026
628e651
docs: add OTel Demo and GHCR images to READMEs
kylehounslow Mar 19, 2026
7daac67
feat: add enterprise sizing knobs and sizing guide
kylehounslow Mar 19, 2026
1686469
test: add helm-unittest suite for chart templates
kylehounslow Mar 19, 2026
ee6cc63
ci: add helm lint + unittest workflow
kylehounslow Mar 19, 2026
b2530cd
feat: add load testing scripts (OSB, telemetrygen, k6 API + browser)
kylehounslow Mar 20, 2026
b0ac3f6
feat: add weather/events agent URLs to canary for trace variety
kylehounslow Mar 20, 2026
43e2e98
feat: add anonymous authentication support to Helm chart
kylehounslow Mar 20, 2026
a59695d
feat: load test results — 300 VUs clean, 1500 VUs shows p95=2.3s degr…
kylehounslow Mar 20, 2026
97c67d7
feat: add self-monitoring saved queries + capacity planning dashboard…
kylehounslow Mar 20, 2026
a51b97a
fix: wrap high-cardinality PromQL in sum() to fix OSD Prometheus dese…
kylehounslow Mar 20, 2026
2c70dd7
fix: use delete-then-recreate for PromQL dashboards (sync with docker…
kylehounslow Mar 20, 2026
bcf3b37
fix: remove strict-mapping-rejected fields from explore objects (OSD …
kylehounslow Mar 20, 2026
84d2afe
fix: restore explore object fields + use sum() without by for thread …
kylehounslow Mar 20, 2026
a8ca9bb
fix: use correct metric names for active searches and search latency …
kylehounslow Mar 20, 2026
60c6ff5
feat: add query cache size panel to OpenSearch health dashboard
kylehounslow Mar 20, 2026
531ea36
feat: add EC2 load generator terraform + ALB-routed k6 scripts
kylehounslow Mar 20, 2026
1d33373
results: Test 004 — OSD is the bottleneck at 100m CPU, 3s+ latency un…
kylehounslow Mar 20, 2026
b17c00b
perf: scale OSD to 3 replicas, 2 CPU / 2Gi memory (was 100m/512M)
kylehounslow Mar 20, 2026
7013ea3
results: Test 005 — OSD bottleneck resolved, OpenSearch single node a…
kylehounslow Mar 20, 2026
307e3eb
docs: add OpenSearch scaling strategy from official docs + OSD scalin…
kylehounslow Mar 20, 2026
a144c91
perf: scale OpenSearch to 3 nodes (4Gi RAM, 2Gi JVM heap, 2 CPU each)
kylehounslow Mar 20, 2026
5c51b46
results: Test 006 — 3 OS nodes: 37% more throughput, p95 10.5s (was 1…
kylehounslow Mar 20, 2026
16abdca
docs: add capacity sizing chart with VU estimates, data projections, …
kylehounslow Mar 20, 2026
6c3c38f
docs: add AGENTS.md with full load testing procedures for reproducibi…
kylehounslow Mar 20, 2026
d33c863
results: Test 007 — balanced shards: p95 6.32s (was 10.57s), 168 req/…
kylehounslow Mar 20, 2026
00169ec
Merge pull request #8 from kylehounslow/feat/helm-anon-auth
kylehounslow Mar 20, 2026
8d00637
fix: bake all runtime config into values.yaml (never use --set for pe…
kylehounslow Mar 21, 2026
7fb5e73
feat: production EKS sizing (3-node OS, persistent Prometheus, otel-d…
kylehounslow Mar 21, 2026
86baf96
Revert "Merge pull request #8 from kylehounslow/feat/helm-anon-auth"
kylehounslow Mar 21, 2026
6be852a
fix: revert anon auth PR, add missing admin password, disable example…
kylehounslow Mar 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/workflows/helm-test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: Helm Tests

on:
push:
branches: [main]
paths:
- 'charts/**'
- 'test/helm-test.sh'
pull_request:
branches: [main]
paths:
- 'charts/**'
- 'test/helm-test.sh'
workflow_dispatch:

jobs:
helm-test:
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@v4

- name: Set up Helm
uses: azure/setup-helm@v4

- name: Install helm-unittest plugin
run: helm plugin install https://github.com/helm-unittest/helm-unittest.git

- name: Run helm lint + unittest
run: ./test/helm-test.sh
44 changes: 44 additions & 0 deletions .github/workflows/publish-images.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Build and Push Example Images

on:
push:
branches: [main]
paths:
- 'examples/**'
- 'docker-compose/canary/**'
- '.github/workflows/publish-images.yml'
workflow_dispatch:

# Only runs on the fork — skips silently on opensearch-project/observability-stack
jobs:
build:
if: github.repository == 'kylehounslow/observability-stack'
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
strategy:
matrix:
include:
- name: weather-agent
context: examples/plain-agents/weather-agent
- name: travel-planner
context: examples/plain-agents/multi-agent-planner/orchestrator
- name: events-agent
context: examples/plain-agents/multi-agent-planner/events-agent
- name: mcp-server
context: examples/plain-agents/multi-agent-planner/mcp-server
- name: canary
context: docker-compose/canary
steps:
- uses: actions/checkout@v4
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
context: ${{ matrix.context }}
push: true
tags: ghcr.io/${{ github.repository }}/${{ matrix.name }}:latest
17 changes: 17 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,20 @@ node_modules
dist
build
.DS_Store

# Agent worktrees
.worktrees/

# Terraform
*.tfstate
*.tfstate.backup
.terraform/
.terraform.lock.hcl
.terraform.tfstate.lock.info

# Helm dependency artifacts (regenerated by `helm dependency build`)
charts/*/Chart.lock
charts/*/charts/*.tgz

# Terraform plan files
*.plan
32 changes: 32 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -905,6 +905,38 @@ When creating examples or documentation, always reference the OpenTelemetry Gen-
6. **Commit**: Use descriptive commit messages
7. **Submit PR**: Follow CONTRIBUTING.md guidelines


## Multi-Agent Development with Worktrees

When multiple agents or sessions work on this repo simultaneously, each feature branch gets its own worktree for isolation.

### Structure

```
observability-stack/
├── .worktrees/ # gitignored — one per feature branch
│ ├── feat-self-monitoring/
│ ├── feat-helm-charts/
│ └── fix-docs/
└── ... # main branch
```

### Usage

```bash
# Create
mkdir -p .worktrees
git worktree add .worktrees/<branch-name> <branch-name>

# REQUIRED: Clean up after PR merge
git worktree remove .worktrees/<branch-name>
git branch -d <branch-name>
```

**You MUST remove worktrees after their PR is merged.** Stale worktrees waste disk space and cause confusion about what work is active.

**Terraform limitation:** Terraform state is local and lives in the main repo's `terraform/aws/` directory. It is NOT shared across worktrees. Only run `terraform plan/apply` from the main repo, never from a worktree.

## Common Pitfalls to Avoid

- ❌ Using `latest` image tags (use specific versions)
Expand Down
5 changes: 5 additions & 0 deletions charts/observability-stack/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.DS_Store
.git
*.swp
*.bak
*.tmp
52 changes: 52 additions & 0 deletions charts/observability-stack/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
apiVersion: v2
name: observability-stack
description: OpenTelemetry-native observability platform for microservices, web apps, and AI agents
type: application
version: 0.1.0
appVersion: "3.6.0"

home: https://github.com/opensearch-project/observability-stack
sources:
- https://github.com/opensearch-project/observability-stack

maintainers:
- name: kylehounslow
url: https://github.com/kylehounslow

keywords:
- opensearch
- observability
- opentelemetry
- data-prepper
- agent-tracing

dependencies:
- name: opensearch
version: "3.5.0"
repository: "https://opensearch-project.github.io/helm-charts/"
condition: opensearch.enabled

- name: opensearch-dashboards
version: "3.5.0"
repository: "https://opensearch-project.github.io/helm-charts/"
condition: opensearch-dashboards.enabled

- name: data-prepper
version: "0.3.1"
repository: "https://opensearch-project.github.io/helm-charts/"
condition: data-prepper.enabled

- name: opentelemetry-collector
version: "0.147.0"
repository: "https://open-telemetry.github.io/opentelemetry-helm-charts"
condition: opentelemetry-collector.enabled

- name: prometheus
version: "28.13.0"
repository: "https://prometheus-community.github.io/helm-charts"
condition: prometheus.enabled

- name: opentelemetry-demo
version: "0.40.5"
repository: "https://open-telemetry.github.io/opentelemetry-helm-charts"
condition: opentelemetry-demo.enabled
199 changes: 199 additions & 0 deletions charts/observability-stack/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
# Observability Stack Helm Chart

Umbrella Helm chart that deploys the full observability stack to Kubernetes. Wraps community subcharts (OpenSearch, Prometheus, OTel Collector, Data Prepper) with opinionated defaults and adds self-monitoring dashboards.

## Components

| Subchart | Source | Purpose |
|----------|--------|---------|
| `opensearch` | opensearch-project/helm-charts | Log and trace storage |
| `opensearch-dashboards` | opensearch-project/helm-charts | Web UI |
| `data-prepper` | opensearch-project/helm-charts | OTLP → OpenSearch pipeline |
| `opentelemetry-collector` | open-telemetry/helm-charts | Telemetry receiver and router |
| `prometheus` | prometheus-community/helm-charts | Metrics storage (OTLP + scrape) |

Additional templates (not subcharts):
- `opensearch-exporter` — Prometheus exporter for OpenSearch cluster metrics
- `init-dashboards-job` — Post-install hook that creates index patterns, dashboards, saved queries
- `opensearch-credentials-secret` — Shared credentials secret

## Install

```bash
cd charts/observability-stack
helm dependency build
helm install obs-stack . -n observability-stack --create-namespace
```

For EKS with ALB ingress, use the values override:
```bash
helm install obs-stack . -n observability-stack --create-namespace -f ../../terraform/aws/values-eks.yaml
```

Or use Terraform (recommended) — see `terraform/aws/README.md`.

## Upgrading

The init job (dashboard/index pattern setup) runs as a post-install/post-upgrade hook. It installs pip packages and takes 3-5 minutes, which often exceeds helm's default timeout.

**Recommended upgrade workflow:**
```bash
# 1. Deploy chart changes (skip hooks to avoid timeout)
helm upgrade obs-stack . -n observability-stack -f ../../terraform/aws/values-eks.yaml --no-hooks

# 2. If dashboard or init script changed, trigger the job manually:
kubectl delete job obs-stack-observability-stack-init-dashboards -n observability-stack 2>/dev/null
helm get hooks obs-stack -n observability-stack | kubectl apply -n observability-stack -f -
kubectl wait --for=condition=complete job/obs-stack-observability-stack-init-dashboards -n observability-stack --timeout=10m
kubectl logs -n observability-stack job/obs-stack-observability-stack-init-dashboards --tail=30
```

If only `values.yaml` scrape configs changed (no dashboard changes), step 2 is not needed — but you may need to restart Prometheus to pick up the new configmap:
```bash
kubectl rollout restart deployment obs-stack-prometheus-server -n observability-stack
```

## Self-Monitoring Dashboards

Three dashboards are auto-created by the init job from YAML config files in `files/`:

| Dashboard | Panels | File |
|-----------|--------|------|
| Kubernetes Cluster Health | 8 | `files/dashboard-k8s-cluster-health.yaml` |
| Observability Pipeline Health | 24 | `files/dashboard-pipeline-health.yaml` |
| OpenSearch Cluster Health | 10 | `files/dashboard-opensearch-health.yaml` |

**Adding a new dashboard:**
1. Create `files/dashboard-my-thing.yaml` (see existing files for format)
2. Add it to `templates/init-dashboards-configmap.yaml`
3. Add one line to `main()` in `files/init-opensearch-dashboards.py`:
```python
create_promql_dashboard_from_yaml(workspace_id, "/config/dashboard-my-thing.yaml")
```

**Dashboard YAML format:**
```yaml
dashboard:
id: my-dashboard-id
title: My Dashboard
description: What this monitors

panels:
- id: panel-unique-id
title: "Panel Title"
query: "rate(some_metric_total[5m])"
chartType: line
```

**Syncing with docker-compose:** The docker-compose init script and dashboard YAMLs (`docker-compose/opensearch-dashboards/`) are the source of truth. The helm versions in `files/` should be kept in sync. The only helm-specific addition is the K8s Cluster Health dashboard (not applicable to docker-compose) and the `BASE_URL` env var override in the init script (line 11).

## Prometheus Scrape Targets

Configured via `scrapeConfigs` in `values.yaml`. Default K8s scrape jobs are disabled (saves ~60k series). Active targets:

| Job | Target | Interval |
|-----|--------|----------|
| `prometheus` | localhost:9090 | 60s |
| `otel-collector` | `<release>`-opentelemetry-collector:8888 | 10s |
| `opensearch` | `<release>`-observability-stack-opensearch-exporter:9114 | 30s |
| `data-prepper` | `<release>`-data-prepper:4900 | 30s |
| `node-exporter` | auto-discovered via kubernetes_sd | 60s |
| `kube-state-metrics` | auto-discovered via kubernetes_sd | 60s |

> **Note:** Targets use the helm release name as prefix. The values in `values.yaml` are hardcoded to `obs-stack-*` — update them if you change the release name.

## Sizing Guide

The default values are tuned for development/demo (single-node OpenSearch, minimal resources). For production or enterprise-scale deployments, adjust the following knobs.

### OpenSearch Cluster

| Knob | Default | Production Guidance |
|------|---------|---------------------|
| `opensearch.replicas` | `1` | 3+ data nodes minimum for HA |
| `opensearch.singleNode` | `true` | Set `false` for multi-node |
| `opensearch.resources.requests.memory` | `2Gi` | 8–64Gi per node (JVM gets 50%) |
| `opensearch.persistence.size` | `8Gi` | Size per formula below |
| `opensearch.extraEnvs[OPENSEARCH_JAVA_OPTS]` | `-Xms1g -Xmx1g` | 50% of node RAM, max 31g |

**Storage formula:**
```
storage_per_node = (daily_ingest_GB × 1.45 × (replicas + 1) × retention_days) / node_count
```
The 1.45x multiplier accounts for indexing overhead (10%), OS reserved space for merges (20%), filesystem overhead (5%), and node failure buffer (10%).

**Shard sizing:**
- Logs/traces (write-heavy): 30–50 GB per primary shard
- Search (latency-sensitive): 10–30 GB per primary shard
- Total shards should be a multiple of data node count
- Max 25 shards per GB of JVM heap

Shard count is configurable per Data Prepper pipeline sink via `number_of_shards` and `number_of_replicas` (commented out in `values.yaml`).

### Data Prepper Pipeline Tuning

| Knob | Default | Description |
|------|---------|-------------|
| `data-prepper.pipelineConfig.config.otel-logs-pipeline.workers` | `5` | Parallel log processing threads |
| `...opensearch.number_of_shards` | (OS default: 1) | Primary shards per index |
| `...opensearch.number_of_replicas` | (OS default: 1) | Replica shards per primary |
| `...opensearch.bulk_size` | `5` (MiB) | Bulk request size to OpenSearch |

### Prometheus

| Knob | Default | Description |
|------|---------|-------------|
| `prometheus.server.retention` | `15d` | How long metrics are kept |
| `prometheus.server.persistentVolume.enabled` | `false` | Enable for production |
| `prometheus.server.persistentVolume.size` | `8Gi` | Disk for metrics TSDB |

### Quick Reference: Sizing Profiles

| Profile | OS Nodes | OS Memory | OS Disk | Prometheus Retention |
|---------|----------|-----------|---------|---------------------|
| **Dev/Demo** (default) | 1 | 2Gi | 8Gi | 15d |
| **Small team** (~10 GB/day) | 3 | 8Gi | 100Gi | 30d |
| **Enterprise** (~100 GB/day) | 6+ | 32Gi | 500Gi+ | 90d |

Sources: [OpenSearch shard sizing](https://opensearch.org/blog/optimize-opensearch-index-shard-size/), [AWS sizing guide](https://docs.aws.amazon.com/prescriptive-guidance/latest/opensearch-service-migration/sizing.html), [AWS shard best practices](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/bp-sharding.html)

## Key Values

See `values.yaml` for all options. Notable settings:

```yaml
# Credentials (update opensearchPassword before any real deployment)
opensearchUsername: "admin"
opensearchPassword: "My_password_123!@#"

# Data Prepper metrics port (must be in ports list for Prometheus to scrape)
data-prepper:
ports:
- name: metrics
port: 4900

# Disable noisy K8s scrape defaults
prometheus:
scrapeConfigs:
kubernetes-api-servers: { enabled: false }
# ... etc
```

## OpenTelemetry Demo (Optional)

The [OpenTelemetry Demo](https://opentelemetry.io/docs/demo/) is available as an optional subchart. It deploys a full microservices e-commerce app (20+ services) that generates realistic telemetry — useful for load testing and showcasing the stack.

Disabled by default (~2GB additional memory required).

**Enable:**
```bash
helm upgrade obs-stack . -n observability-stack -f ../../terraform/aws/values-eks.yaml \
--set opentelemetry-demo.enabled=true --no-hooks
```

**Disable:**
```bash
helm upgrade obs-stack . -n observability-stack -f ../../terraform/aws/values-eks.yaml --no-hooks
```

All bundled backends (Jaeger, Grafana, Prometheus, OpenSearch) in the demo chart are disabled — demo services send telemetry to our OTel Collector. No duplicate infrastructure.
Loading
Loading