Umbrella Helm chart that deploys the full OpenSearch observability stack to Kubernetes. Wraps community subcharts (OpenSearch, OpenSearch Dashboards, OTel Collector, Data Prepper) plus Cortex + Alertmanager rendered as native templates, with opinionated defaults and self-monitoring dashboards. Enables key observability features like APM, Agent-Tracing, and pre-canned alerting by default.
| Subchart | Source | Purpose |
|---|---|---|
opensearch |
opensearch-project/helm-charts | Log and trace storage |
opensearch-dashboards |
opensearch-project/helm-charts | Web UI |
data-prepper |
opensearch-project/helm-charts | OTLP → OpenSearch pipeline |
opentelemetry-collector |
open-telemetry/helm-charts | Telemetry receiver and router |
Native templates (rendered directly by this chart, not subcharts):
cortex-*— Single-binary Cortex (Prometheus-compatible TSDB + ruler), exposed as Service<release>-prometheus-serverso existing callers keep workingalertmanager-*— Alertmanager rendered as Deployment + Service + PVC + Secret. Indexes every alert into OpenSearch for searchable historycortex-rules-configmap— Cortex ruler rule groups loaded fromfiles/rules-stack/andfiles/rules-otel-demo/alerting-rules-monitors-init-job/otel-demo-alerting-rules-monitors-init-job— Post-install hooks that load rules into Cortex Ruler and OpenSearch alerting monitorsopensearch-exporter— Bridges OpenSearch cluster metrics to Cortex (OpenSearch has no native Prometheus endpoint)init-dashboards-job— Post-install hook that creates index patterns, dashboards, saved queriesopensearch-credentials-secret— Shared credentials secret for all componentsdata-prepper-pipeline-secret— Pipeline config with credentials injected at template timeotel-collector-configmap— Collector config with dynamic service names, Cortex remote-write exporter, and self/envoy scrape
helm install obs charts/observability-stackFor local development (kind) with reduced resources:
helm install obs charts/observability-stack \
--set opensearch.singleNode=true \
--set opensearch.replicas=1 \
--set opensearch.resources.requests.memory=1Gi \
--set opensearch.resources.limits.memory=1Gi \
--set opensearch.opensearchJavaOpts="-Xms512m -Xmx512m" \
--set opensearch.persistence.size=2GiAll components read credentials from a single opensearch-credentials Kubernetes Secret, sourced from opensearchUsername and opensearchPassword in values.yaml:
| Component | How it reads credentials |
|---|---|
| OpenSearch | secretKeyRef → OPENSEARCH_INITIAL_ADMIN_PASSWORD |
| OpenSearch Dashboards | opensearchAccount.secret (native sub-chart feature) |
| Data Prepper | Pipeline config rendered as Secret template |
| Init Job | secretKeyRef → OPENSEARCH_USER / OPENSEARCH_PASSWORD |
Set a custom password at install time:
helm install obs charts/observability-stack --set opensearchPassword="YourSecurePassword!"Note: The password is set at first boot.
OPENSEARCH_INITIAL_ADMIN_PASSWORDonly takes effect when OpenSearch initializes its security index. To change the password after install, use the Security REST API orsecurity-admin.sh, then update the Secret.
By default, aws eks update-kubeconfig gives operators full cluster-admin access. The chart provides optional scoped ServiceAccounts so operators can choose between readonly access for safe monitoring and admin access for explicit write operations. This prevents accidental modifications when you only intend to observe.
Note: This controls Kubernetes API access (kubectl, helm). It is unrelated to OpenSearch user authentication — see Credential Management for OpenSearch credentials.
Enable:
rbac:
enabled: trueThis creates two Kubernetes ServiceAccounts with long-lived tokens:
| Account | ClusterRole | Purpose |
|---|---|---|
<release>-admin |
cluster-admin |
Full K8s cluster access — deployments, scaling, deletes |
<release>-readonly |
view |
Read-only — get, list, watch across all namespaces |
Generate a scoped kubeconfig:
NS=observability-stack
RELEASE=obs-stack
# Extract the token (admin or readonly)
TOKEN=$(kubectl get secret ${RELEASE}-observability-stack-readonly-token -n $NS \
-o jsonpath='{.data.token}' | base64 -d)
# Get cluster info from current kubeconfig
SERVER=$(kubectl config view --minify -o jsonpath='{.clusters[0].cluster.server}')
CA=$(kubectl config view --minify --raw -o jsonpath='{.clusters[0].cluster.certificate-authority-data}')
# Write a standalone kubeconfig
kubectl config set-cluster obs --server="$SERVER" \
--certificate-authority=<(echo "$CA" | base64 -d) --embed-certs \
--kubeconfig=~/.kube/obs-stack-readonly.yaml
kubectl config set-credentials readonly --token="$TOKEN" \
--kubeconfig=~/.kube/obs-stack-readonly.yaml
kubectl config set-context obs-readonly --cluster=obs --user=readonly \
--kubeconfig=~/.kube/obs-stack-readonly.yaml
kubectl config use-context obs-readonly --kubeconfig=~/.kube/obs-stack-readonly.yamlUse the scoped kubeconfig:
# Safe monitoring — cannot modify any resources
KUBECONFIG=~/.kube/obs-stack-readonly.yaml kubectl get pods -n observability-stack
# Explicit admin operations — use the admin kubeconfig only when needed
KUBECONFIG=~/.kube/obs-stack-admin.yaml kubectl rollout restart deployment ...The readonly account is recommended as the default for day-to-day monitoring. Switch to the admin kubeconfig only when you need to make changes.
The OTel Collector config and Data Prepper pipeline config are rendered as parent-chart templates using {{ .Release.Name }}. This means the chart works with any release name — service hostnames are resolved automatically.
The chart's defaults are tuned for development convenience and use insecure intra-cluster transport for the telemetry pipeline:
- OTel Collector → Data Prepper runs over plaintext gRPC (
tls.insecure: true,insecure_skip_verify: true) - Data Prepper → OpenSearch trusts the cluster's self-signed cert (
insecure: true) - OpenSearch Dashboards → OpenSearch uses
verificationMode: none
For production, you should:
- Front the cluster with a service mesh (e.g. Istio, Linkerd) for mTLS between pods, or
- Issue a real CA-signed certificate for OpenSearch (cert-manager + a Secret) and flip
tls.insecure/insecure_skip_verifytofalseintemplates/otel-collector-configmap.yaml, plus the matching settings in the Data Prepper pipeline Secret and Dashboards config - Always use the Gateway API section below to terminate TLS for external Dashboards traffic
The chart includes optional Gateway API resources for exposing OpenSearch Dashboards with TLS. Disabled by default.
Supported providers:
- Envoy Gateway — for local dev or self-managed clusters
- AWS Gateway API Controller — for EKS with VPC Lattice
helm install obs . \
--set gateway.enabled=true \
--set gateway.host=dashboards.example.com \
--set gateway.tls.secretName=dashboards-tlsSee docs/local-tls.md for a full local development walkthrough with mkcert and Envoy Gateway. See values.yaml for all gateway options including AWS provider configuration.
The init job (dashboard/index pattern setup) runs as a post-install/post-upgrade hook. It installs pip packages and takes 3-5 minutes, which often exceeds helm's default timeout.
Recommended upgrade workflow:
# 1. Deploy chart changes (skip hooks to avoid timeout)
helm upgrade obs-stack . -n observability-stack --no-hooks
# 2. If dashboard or init script changed, trigger the job manually:
kubectl delete job obs-stack-observability-stack-init-dashboards -n observability-stack 2>/dev/null
helm get hooks obs-stack -n observability-stack | kubectl apply -n observability-stack -f -
kubectl wait --for=condition=complete job/obs-stack-observability-stack-init-dashboards -n observability-stack --timeout=10m
kubectl logs -n observability-stack job/obs-stack-observability-stack-init-dashboards --tail=30If only values.yaml config changed (no dashboard changes), step 2 is not needed — but you may need to restart Cortex to pick up the new configmap (the checksum/config annotation on the Cortex pod usually does this for you on helm upgrade):
kubectl rollout restart deployment obs-stack-observability-stack-cortex -n observability-stackThree dashboards are auto-created by the init job from YAML config files in files/:
| Dashboard | Panels | File |
|---|---|---|
| Kubernetes Cluster Health | 8 | files/dashboard-k8s-cluster-health.yaml |
| Observability Pipeline Health | 24 | files/dashboard-pipeline-health.yaml |
| OpenSearch Cluster Health | 10 | files/dashboard-opensearch-health.yaml |
Adding a new dashboard:
- Create
files/dashboard-my-thing.yaml(see existing files for format) - Add it to
templates/init-dashboards-configmap.yaml - Add one line to
main()infiles/init-opensearch-dashboards.py:create_promql_dashboard_from_yaml(workspace_id, "/config/dashboard-my-thing.yaml")
Dashboard YAML format:
dashboard:
id: my-dashboard-id
title: My Dashboard
description: What this monitors
panels:
- id: panel-unique-id
title: "Panel Title"
query: "rate(some_metric_total[5m])"
chartType: lineSyncing with docker-compose: The docker-compose init script and dashboard YAMLs (docker-compose/opensearch-dashboards/) are the source of truth. The helm versions in files/ should be kept in sync. The only helm-specific addition is the K8s Cluster Health dashboard (not applicable to docker-compose) and the BASE_URL env var override in the init script (line 11).
Metrics flow into Cortex via three paths:
- OTLP push from instrumented apps → OTel Collector →
prometheusremotewrite/cortexexporter → Cortex/api/v1/push - Self-scrape in the OTel Collector pulls its own
:8888Prometheus endpoint and remote-writes to Cortex (prometheus/selfreceiver) - Envoy scrape when the otel-demo overlay is enabled — pulls
frontend-proxy:10000/stats/prometheusfor ingress RED visibility (prometheus/envoyreceiver)
resource_to_telemetry_conversion: true on the Cortex exporter promotes OTel resource attributes (service.name, service.version, ...) into Prometheus labels, so series carry service_name etc.
Cortex's own ruler evaluates rule groups loaded from files/rules-stack/ (always) and files/rules-otel-demo/ (when otel-demo is enabled), and routes firing alerts to the in-cluster Alertmanager, which webhooks them into the OpenSearch alertmanager-alerts index for history + search.
The default values deploy a 3-node OpenSearch cluster suitable for small production workloads. For local development, override to single-node (see Install). For enterprise-scale deployments, adjust the following knobs.
| Knob | Default | Production Guidance |
|---|---|---|
opensearch.replicas |
3 |
3+ data nodes minimum for HA |
opensearch.singleNode |
false |
Set true only for local dev (kind) |
opensearch.resources.requests.memory |
2Gi |
8–64Gi per node (JVM gets 50%) |
opensearch.persistence.size |
8Gi |
Size per formula below |
opensearch.extraEnvs[OPENSEARCH_JAVA_OPTS] |
-Xms1g -Xmx1g |
50% of node RAM, max 31g |
Storage formula:
storage_per_node = (daily_ingest_GB × 1.45 × (replicas + 1) × retention_days) / node_count
The 1.45x multiplier accounts for indexing overhead (10%), OS reserved space for merges (20%), filesystem overhead (5%), and node failure buffer (10%).
Shard sizing:
- Logs/traces (write-heavy): 30–50 GB per primary shard
- Search (latency-sensitive): 10–30 GB per primary shard
- Total shards should be a multiple of data node count
- Max 25 shards per GB of JVM heap
Shard count is configurable per Data Prepper pipeline sink via number_of_shards and number_of_replicas (commented out in values.yaml).
| Knob | Default | Description |
|---|---|---|
data-prepper.pipelineConfig.config.otel-logs-pipeline.workers |
5 |
Parallel log processing threads |
...opensearch.number_of_shards |
(OS default: 1) | Primary shards per index |
...opensearch.number_of_replicas |
(OS default: 1) | Replica shards per primary |
...opensearch.bulk_size |
5 (MiB) |
Bulk request size to OpenSearch |
| Knob | Default | Description |
|---|---|---|
opentelemetry-collector.resources.requests.cpu |
256m |
CPU request |
opentelemetry-collector.resources.requests.memory |
512Mi |
Memory request |
opentelemetry-collector.resources.limits.cpu |
1 |
CPU limit |
opentelemetry-collector.resources.limits.memory |
2Gi |
Memory limit |
The collector's memory_limiter processor (80% limit, 25% spike) provides backpressure before the OOM kill threshold.
| Knob | Default | Description |
|---|---|---|
cortex.retention |
15d |
Block-retention period for the compactor (older blocks are deleted) |
cortex.persistence.enabled |
true |
Persist blocks/TSDB to a PVC |
cortex.persistence.size |
50Gi |
PVC size for /data (blocks + ruler-storage) |
cortex.resources.requests.memory |
500Mi |
Base memory request — bump for production cardinality (EKS overlay sets 2Gi) |
cortex.resources.limits.memory |
1Gi |
Memory cap — bump for production (EKS overlay sets 4Gi) |
| Knob | Default | Description |
|---|---|---|
alertmanager.persistence.size |
5Gi |
PVC for silence/notification state |
alertmanager.resources.requests.memory |
64Mi |
Base memory request |
alertmanager.resources.limits.memory |
128Mi |
Memory cap |
| Profile | OS Nodes | OS Memory | OS Disk | OTel Collector Memory | Cortex Retention |
|---|---|---|---|---|---|
| Dev/Demo (default) | 3 | 2Gi | 8Gi | 2Gi | 15d |
| Small team (~10 GB/day) | 3 | 8Gi | 100Gi | 2Gi | 30d |
| Enterprise (~100 GB/day) | 6+ | 32Gi | 500Gi+ | 4Gi+ | 90d |
Sources: OpenSearch shard sizing, AWS sizing guide, AWS shard best practices
For advanced cluster topologies (dedicated cluster manager nodes, coordinating nodes, hot-warm-cold architecture):
- Tuning your cluster — official guide covering node roles, dedicated nodes, shard allocation, and production recommendations
- Setup multi-node cluster on Kubernetes using Helm — walkthrough for dedicated cluster manager, data, and coordinating nodes with the official Helm chart
- Sizing domains — AWS sizing calculator and methodology (applicable to self-managed clusters)
This Helm chart mirrors the Docker Compose configuration for feature parity. See SYNC.md for the current sync checkpoint, what stays in sync vs. what is intentionally different, and the full SOP for detecting and fixing drift.
See values.yaml for all options. Notable settings:
# Credentials (update opensearchPassword before any real deployment)
opensearchUsername: "admin"
opensearchPassword: "My_password_123!@#"
# Data Prepper metrics port (must be in ports list for OTel Collector to scrape)
data-prepper:
ports:
- name: metrics
port: 4900
# Cortex (single-binary Prometheus replacement) sizing
cortex:
retention: "15d"
persistence:
size: 50Gi
resources:
requests:
memory: "500Mi"
limits:
memory: "1Gi"
# Alertmanager fan-out
alertmanager:
enabled: true
# OpenSearch Service hostname used by alertmanager + init Jobs (override
# this if you set opensearch.fullnameOverride or otherwise rewire it).
opensearchServiceName: "opensearch-cluster-master"The OpenTelemetry Demo is available as an optional subchart. It deploys a full microservices e-commerce app (20+ services) that generates realistic telemetry — useful for load testing and showcasing the stack.
Disabled by default (~2GB additional memory required).
Enable:
helm upgrade obs-stack . -n observability-stack \
--set opentelemetry-demo.enabled=true --no-hooksDisable:
helm upgrade obs-stack . -n observability-stack --no-hooksAll bundled backends (Jaeger, Grafana, Prometheus, OpenSearch) in the demo chart are disabled — demo services send telemetry to our OTel Collector, which fans metrics to Cortex and traces/logs to Data Prepper → OpenSearch. No duplicate infrastructure.