Skip to content

Commit 6a3d099

Browse files
authored
fix(helm): sync chart with compose, Cortex dashboard queries, external-dns safety (#255)
* fix(helm): sync init-opensearch-dashboards.py with Docker Compose (#254) Ports the ndjson index-pattern dedup fix and delayed field refresh from the compose init script into the helm chart version: - Skip index-pattern objects from ndjson import to prevent duplicates - Rewrite ndjson references to point at live index-pattern IDs - Refresh field lists for logs/traces/service-map patterns after 10min - Bump opentelemetry-demo subchart 0.40.5 → 0.40.8 (appVersion 2.2.0) Syncs with compose commit 6b8b2c8. Signed-off-by: ps48 <pshenoy36@gmail.com> * fix: Cortex dashboard queries, job name length, OSD flags, external-dns safety - dashboard-pipeline-health: replace Prometheus TSDB queries with Cortex equivalents (ingestion rate, active series, query latency, memory series); remove WAL/Head Chunks panels (no Cortex equivalent) - otel-demo-alerting-rules-monitors-init-job: shorten resource names to fix K8s 63-char label limit (otel-demo-alerts-init) - values-anonymous-auth: add enableIconSideNav, alertManager.enabled, slo.enabled flags - addons.tf: scope external-dns domainFilters to var.domain (not entire zone) and use upsert-only policy to prevent production DNS overwrites Signed-off-by: ps48 <pshenoy36@gmail.com> --------- Signed-off-by: ps48 <pshenoy36@gmail.com>
1 parent e14be74 commit 6a3d099

15 files changed

Lines changed: 285 additions & 70 deletions

.env

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,10 @@ INCLUDE_COMPOSE_EXAMPLES=docker-compose.examples.yml
1111
INCLUDE_COMPOSE_LOCAL_OPENSEARCH=docker-compose.local-opensearch.yml
1212
INCLUDE_COMPOSE_LOCAL_OPENSEARCH_DASHBOARDS=docker-compose.local-opensearch-dashboards.yml
1313

14-
OPENSEARCH_DOCKER_REPO=opensearchstaging
14+
# TODO: Change to opensearchproject after 3.7.0 official release
15+
# OPENSEARCH_DOCKER_REPO=opensearchstaging
16+
OPENSEARCH_IMAGE=ashisagr32966/opensearch-sql-main:3.7.0-sql-main
17+
OPENSEARCH_DASHBOARDS_IMAGE=joshuali925/opensearch-dashboards:3.7.0-slos-pr-against-main-v2
1518

1619

1720
# OpenSearch Configuration

charts/observability-stack/Chart.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,6 @@ dependencies:
5252
condition: opentelemetry-collector.enabled
5353

5454
- name: opentelemetry-demo
55-
version: "0.40.5"
55+
version: "0.40.8"
5656
repository: "https://open-telemetry.github.io/opentelemetry-helm-charts"
5757
condition: opentelemetry-demo.enabled

charts/observability-stack/SYNC.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,10 @@ This Helm chart mirrors the Docker Compose configuration for feature parity acro
66

77
| Field | Value |
88
|-------|-------|
9-
| **Last synced commit** | `a584256` |
10-
| **Commit message** | `add insecure: true flag to connect to cortex/prometheus (#249)` |
9+
| **Last synced commit** | `6b8b2c8` |
10+
| **Commit message** | `fix: deduplicate index patterns in dashboards init and pin otel-demo to 2.2.0 (#254)` |
1111
| **Date** | 2026-05-19 |
12-
| **Synced by** | @ashisagr |
12+
| **Synced by** | @ps48 |
1313

1414
## What Stays in Sync
1515

charts/observability-stack/files/dashboard-pipeline-health.yaml

Lines changed: 14 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -50,36 +50,26 @@ panels:
5050
query: "otelcol_processor_batch_metadata_cardinality"
5151
chartType: line
5252

53-
# --- Row 5: Prometheus Health ---
54-
- id: pipeline-prometheus-ingestion
55-
title: "Prometheus Ingestion Rate (chunks/sec)"
56-
query: "rate(prometheus_tsdb_head_chunks_created_total[5m])"
53+
# --- Row 5: Cortex Health ---
54+
- id: pipeline-cortex-ingestion-rate
55+
title: "Cortex Ingestion Rate (samples/sec)"
56+
query: "avg(cortex_ingester_ingestion_rate_samples_per_second)"
5757
chartType: line
5858

59-
- id: pipeline-prometheus-active-series
60-
title: "Prometheus Active Time Series"
61-
query: "prometheus_tsdb_head_series"
59+
- id: pipeline-cortex-active-series
60+
title: "Cortex Active Time Series"
61+
query: "cortex_ingester_active_series"
6262
chartType: line
6363

64-
# --- Row 6: Prometheus Storage ---
65-
- id: pipeline-prometheus-wal-size
66-
title: "Prometheus WAL Size (bytes)"
67-
query: "prometheus_tsdb_wal_storage_size_bytes"
64+
# --- Row 6: Cortex Query Performance ---
65+
- id: pipeline-cortex-query-latency
66+
title: "Cortex Query Latency P99 (sec)"
67+
query: "histogram_quantile(0.99, rate(cortex_request_duration_seconds_bucket{route=~\"prometheus_api_v1_query\"}[5m]))"
6868
chartType: line
6969

70-
- id: pipeline-prometheus-head-chunks
71-
title: "Prometheus Head Chunks Size (bytes)"
72-
query: "prometheus_tsdb_head_chunks_storage_size_bytes"
73-
chartType: line
74-
75-
- id: pipeline-prometheus-query-latency
76-
title: "Prometheus Query Latency P99 (sec)"
77-
query: "histogram_quantile(0.99, rate(prometheus_http_request_duration_seconds_bucket{handler=\"/api/v1/query\"}[5m]))"
78-
chartType: line
79-
80-
- id: pipeline-prometheus-range-query-latency
81-
title: "Prometheus Range Query Latency P99 (sec)"
82-
query: "histogram_quantile(0.99, rate(prometheus_http_request_duration_seconds_bucket{handler=\"/api/v1/query_range\"}[5m]))"
70+
- id: pipeline-cortex-range-query-latency
71+
title: "Cortex Range Query Latency P99 (sec)"
72+
query: "histogram_quantile(0.99, rate(cortex_request_duration_seconds_bucket{route=~\"prometheus_api_v1_query_range\"}[5m]))"
8373
chartType: line
8474

8575
# --- Row 7: Data Prepper — Logs Pipeline ---

charts/observability-stack/files/init-opensearch-dashboards.py

Lines changed: 109 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1700,13 +1700,17 @@ def _create_saved_object_directly(workspace_id, obj):
17001700
return False
17011701

17021702

1703-
def import_ndjson_dashboard(workspace_id, ndjson_path):
1703+
def import_ndjson_dashboard(workspace_id, ndjson_path, id_mappings=None):
17041704
"""Import a dashboard and its dependencies from an ndjson export file.
17051705
17061706
Objects that reference virtual index patterns (e.g. Prometheus datasource)
17071707
are separated out and created individually via the saved-objects API, which
17081708
does not validate references. The remaining objects are bulk-imported via the
17091709
_import API.
1710+
1711+
id_mappings: optional dict of {old_id: new_id} to rewrite references at
1712+
import time. Used to point exported objects at the live index-pattern IDs
1713+
instead of the hardcoded ones from the export environment.
17101714
"""
17111715
import json
17121716
import io
@@ -1734,11 +1738,22 @@ def import_ndjson_dashboard(workspace_id, ndjson_path):
17341738
# Skip the export summary line (has exportedCount but no type)
17351739
if "exportedCount" in obj and "type" not in obj:
17361740
continue
1741+
# Skip index-pattern objects — they are created earlier by the init
1742+
# script with the correct workspace/datasource associations. Importing
1743+
# them again from the ndjson would create duplicates with different IDs.
1744+
if obj.get("type") == "index-pattern":
1745+
continue
17371746
# Remove workspace associations so objects land in the target workspace
17381747
obj.pop("workspaces", None)
17391748
# Remove version field that can cause conflicts on import
17401749
obj.pop("version", None)
17411750

1751+
# Rewrite references to point at the live index-pattern IDs
1752+
if id_mappings:
1753+
for ref in obj.get("references", []):
1754+
if ref.get("id") in id_mappings:
1755+
ref["id"] = id_mappings[ref["id"]]
1756+
17421757
if _has_virtual_reference(obj):
17431758
direct_create.append(obj)
17441759
else:
@@ -1847,8 +1862,15 @@ def main():
18471862
prometheus_datasource_id = create_prometheus_datasource(workspace_id)
18481863
create_opensearch_datasource(workspace_id)
18491864

1850-
# Import Astronomy Shop dashboard (ndjson export with all dependencies)
1851-
import_ndjson_dashboard(workspace_id, "/config/dashboard-astronomy-shop.ndjson")
1865+
# Import Astronomy Shop dashboard (ndjson export with all dependencies).
1866+
# Map the hardcoded index-pattern IDs from the export to the live IDs
1867+
# created above, so dashboard panels reference the correct datasets.
1868+
ndjson_id_mappings = {}
1869+
if logs_pattern_id:
1870+
ndjson_id_mappings["545c7990-2938-11f1-84ad-e734b5ac5a91"] = logs_pattern_id
1871+
if traces_pattern_id:
1872+
ndjson_id_mappings["54f4c1f0-2938-11f1-84ad-e734b5ac5a91"] = traces_pattern_id
1873+
import_ndjson_dashboard(workspace_id, "/config/dashboard-astronomy-shop.ndjson", ndjson_id_mappings)
18521874

18531875
# Create APM config correlation (ties traces + service map + Prometheus)
18541876
if traces_pattern_id and service_map_pattern_id:
@@ -1872,5 +1894,89 @@ def main():
18721894
print(f"📈 Prometheus: http://localhost:{PROMETHEUS_PORT}")
18731895
print()
18741896

1897+
def refresh_index_pattern_fields(workspace_id, pattern_id, title):
1898+
"""Refresh the fields list for an index pattern by querying OpenSearch mappings.
1899+
1900+
OSD populates fields lazily on first Discover/Explore visit. This function
1901+
triggers the same refresh via the API so field lists are available immediately.
1902+
"""
1903+
if not pattern_id:
1904+
return False
1905+
1906+
if workspace_id and workspace_id != "default":
1907+
url = f"{BASE_URL}/w/{workspace_id}/api/index_patterns/_fields_for_wildcard?pattern={title}&meta_fields=_source&meta_fields=_id&meta_fields=_type&meta_fields=_index&meta_fields=_score"
1908+
else:
1909+
url = f"{BASE_URL}/api/index_patterns/_fields_for_wildcard?pattern={title}&meta_fields=_source&meta_fields=_id&meta_fields=_type&meta_fields=_index&meta_fields=_score"
1910+
1911+
try:
1912+
resp = requests.get(
1913+
url, auth=(USERNAME, PASSWORD),
1914+
headers={"Content-Type": "application/json", "osd-xsrf": "true"},
1915+
verify=False, timeout=30,
1916+
)
1917+
if resp.status_code != 200:
1918+
print(f" ⚠️ Failed to fetch fields for {title}: {resp.status_code}")
1919+
return False
1920+
1921+
fields = resp.json().get("fields", [])
1922+
if not fields:
1923+
print(f" ⏭️ No fields found for {title} (index may be empty)")
1924+
return False
1925+
1926+
import json
1927+
fields_json = json.dumps(fields)
1928+
1929+
if workspace_id and workspace_id != "default":
1930+
put_url = f"{BASE_URL}/w/{workspace_id}/api/saved_objects/index-pattern/{pattern_id}"
1931+
else:
1932+
put_url = f"{BASE_URL}/api/saved_objects/index-pattern/{pattern_id}"
1933+
1934+
put_resp = requests.put(
1935+
put_url, auth=(USERNAME, PASSWORD),
1936+
headers={"Content-Type": "application/json", "osd-xsrf": "true"},
1937+
json={"attributes": {"fields": fields_json}},
1938+
verify=False, timeout=10,
1939+
)
1940+
if put_resp.status_code == 200:
1941+
print(f" ✅ Refreshed fields for {title} ({len(fields)} fields)")
1942+
return True
1943+
else:
1944+
print(f" ⚠️ Failed to update fields for {title}: {put_resp.status_code}")
1945+
return False
1946+
except requests.exceptions.RequestException as e:
1947+
print(f" ⚠️ Error refreshing fields for {title}: {e}")
1948+
return False
1949+
1950+
1951+
def delayed_field_refresh(workspace_id, patterns):
1952+
"""Wait for data to land in indices, then refresh field lists.
1953+
1954+
Called after the main init completes. Waits 10 minutes for the otel-demo
1955+
and agent examples to populate indices with representative documents so
1956+
the field refresh picks up all mapped fields.
1957+
"""
1958+
delay_minutes = 10
1959+
print(f"\n⏳ Waiting {delay_minutes} minutes for indices to populate before refreshing fields...")
1960+
time.sleep(delay_minutes * 60)
1961+
1962+
print("🔄 Refreshing index pattern field lists...")
1963+
for pattern_id, title in patterns:
1964+
refresh_index_pattern_fields(workspace_id, pattern_id, title)
1965+
print("✅ Field refresh complete")
1966+
1967+
18751968
if __name__ == "__main__":
18761969
main()
1970+
1971+
# Re-read workspace and pattern IDs for the delayed refresh.
1972+
workspace_id = get_existing_workspace()
1973+
logs_id = get_existing_index_pattern(workspace_id, "logs-otel-v1*")
1974+
traces_id = get_existing_index_pattern(workspace_id, "otel-v1-apm-span*")
1975+
svc_map_id = get_existing_index_pattern(workspace_id, "otel-v2-apm-service-map*")
1976+
1977+
patterns = [
1978+
(logs_id, "logs-otel-v1*"),
1979+
(traces_id, "otel-v1-apm-span*"),
1980+
(svc_map_id, "otel-v2-apm-service-map*"),
1981+
]
1982+
delayed_field_refresh(workspace_id, patterns)

charts/observability-stack/templates/otel-collector-configmap.yaml

Lines changed: 66 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,71 @@ data:
5454
relabel_configs:
5555
- target_label: service.name
5656
replacement: frontend-proxy
57+
# OpenSearch exporter metrics for the OS Cluster Health dashboard.
58+
prometheus/opensearch-exporter:
59+
config:
60+
scrape_configs:
61+
- job_name: opensearch-exporter
62+
scrape_interval: 15s
63+
static_configs:
64+
- targets: ["{{ include "observability-stack.fullname" . }}-opensearch-exporter:9114"]
65+
relabel_configs:
66+
- target_label: service.name
67+
replacement: opensearch-exporter
68+
# Cortex internal metrics for the Pipeline Health dashboard.
69+
prometheus/cortex:
70+
config:
71+
scrape_configs:
72+
- job_name: cortex
73+
scrape_interval: 15s
74+
metrics_path: /metrics
75+
static_configs:
76+
- targets: ["{{ .Release.Name }}-prometheus-server:80"]
77+
relabel_configs:
78+
- target_label: service.name
79+
replacement: cortex
80+
# Data Prepper pipeline metrics for the Pipeline Health dashboard.
81+
prometheus/data-prepper:
82+
config:
83+
scrape_configs:
84+
- job_name: data-prepper
85+
scrape_interval: 15s
86+
metrics_path: /metrics/sys
87+
static_configs:
88+
- targets: ["{{ .Release.Name }}-data-prepper:4900"]
89+
relabel_configs:
90+
- target_label: service.name
91+
replacement: data-prepper
92+
- job_name: data-prepper-pipelines
93+
scrape_interval: 15s
94+
metrics_path: /metrics/prometheus
95+
static_configs:
96+
- targets: ["{{ .Release.Name }}-data-prepper:4900"]
97+
relabel_configs:
98+
- target_label: service.name
99+
replacement: data-prepper
100+
# Kubernetes metrics — harmless no-op if kube-state-metrics / node-exporter
101+
# are not installed (DNS NXDOMAIN → scrape drops silently).
102+
prometheus/kube-state-metrics:
103+
config:
104+
scrape_configs:
105+
- job_name: kube-state-metrics
106+
scrape_interval: 15s
107+
static_configs:
108+
- targets: ["kube-state-metrics.kube-system.svc.cluster.local:8080"]
109+
relabel_configs:
110+
- target_label: service.name
111+
replacement: kube-state-metrics
112+
prometheus/node-exporter:
113+
config:
114+
scrape_configs:
115+
- job_name: node-exporter
116+
scrape_interval: 15s
117+
static_configs:
118+
- targets: ["node-exporter-prometheus-node-exporter.kube-system.svc.cluster.local:9100"]
119+
relabel_configs:
120+
- target_label: service.name
121+
replacement: node-exporter
57122
processors:
58123
memory_limiter:
59124
check_interval: 5s
@@ -143,7 +208,7 @@ data:
143208
processors: [resourcedetection, memory_limiter, transform, batch]
144209
exporters: [otlp/opensearch, debug]
145210
metrics:
146-
receivers: [otlp, prometheus/self, prometheus/envoy]
211+
receivers: [otlp, prometheus/self, prometheus/envoy, prometheus/opensearch-exporter, prometheus/cortex, prometheus/data-prepper, prometheus/kube-state-metrics, prometheus/node-exporter]
147212
processors: [resourcedetection, memory_limiter, batch]
148213
exporters: [prometheusremotewrite/cortex, debug]
149214
logs:

charts/observability-stack/templates/otel-demo-alerting-rules-monitors-init-job.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@
44
# 1. Cortex ruler rules from /rules/otel_demo
55
# 2. OpenSearch alerting monitors for OTel Demo traces/logs
66
# (checkout, payment, cart, frontend RED signals)
7-
# Mirrors the `otel-demo-alerting-rules-monitors-init` container in
7+
# Mirrors the `otel-demo-alerts-init` container in
88
# docker-compose.otel-demo.yml. Named separately from the base Job (rather
99
# than overlaying its volume mounts) because Helm doesn't support
1010
# subchart-style service overlays — and the upstream API is idempotent
1111
# regardless of how the rules are partitioned across containers.
1212
apiVersion: v1
1313
kind: ConfigMap
1414
metadata:
15-
name: {{ include "observability-stack.fullname" . }}-otel-demo-alerting-rules-monitors-init-script
15+
name: {{ include "observability-stack.fullname" . }}-otel-demo-alerts-init-script
1616
labels:
1717
{{- include "observability-stack.labels" . | nindent 4 }}
1818
app.kubernetes.io/component: init
@@ -29,7 +29,7 @@ data:
2929
apiVersion: batch/v1
3030
kind: Job
3131
metadata:
32-
name: {{ include "observability-stack.fullname" . }}-otel-demo-alerting-rules-monitors-init
32+
name: {{ include "observability-stack.fullname" . }}-otel-demo-alerts-init
3333
labels:
3434
{{- include "observability-stack.labels" . | nindent 4 }}
3535
app.kubernetes.io/component: init
@@ -42,11 +42,11 @@ spec:
4242
template:
4343
metadata:
4444
labels:
45-
app.kubernetes.io/name: otel-demo-alerting-rules-monitors-init
45+
app.kubernetes.io/name: otel-demo-alerts-init
4646
spec:
4747
restartPolicy: OnFailure
4848
containers:
49-
- name: otel-demo-alerting-rules-monitors-init
49+
- name: otel-demo-alerts-init
5050
image: python:3.11-alpine
5151
command:
5252
- /bin/sh
@@ -82,7 +82,7 @@ spec:
8282
volumes:
8383
- name: script
8484
configMap:
85-
name: {{ include "observability-stack.fullname" . }}-otel-demo-alerting-rules-monitors-init-script
85+
name: {{ include "observability-stack.fullname" . }}-otel-demo-alerts-init-script
8686
- name: rules-otel-demo
8787
configMap:
8888
name: {{ include "observability-stack.fullname" . }}-cortex-rules-otel-demo

charts/observability-stack/tests/otel_collector_configmap_test.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ tests:
5151
asserts:
5252
- matchRegex:
5353
path: data.relay
54-
pattern: "receivers:\\s*\\[otlp,\\s*prometheus/self,\\s*prometheus/envoy\\]"
54+
pattern: "receivers:\\s*\\[otlp,\\s*prometheus/self,\\s*prometheus/envoy,\\s*prometheus/opensearch-exporter,\\s*prometheus/cortex,\\s*prometheus/data-prepper,\\s*prometheus/kube-state-metrics,\\s*prometheus/node-exporter\\]"
5555
- matchRegex:
5656
path: data.relay
5757
pattern: "exporters:\\s*\\[prometheusremotewrite/cortex,\\s*debug\\]"

charts/observability-stack/values-anonymous-auth.yaml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -184,6 +184,9 @@ opensearch-dashboards:
184184
datasetManagement.enabled: true
185185
data.savedQueriesNewUI.enabled: true
186186
opensearchDashboards.branding.useExpandedHeader: false
187+
opensearchDashboards.enableIconSideNav: true
188+
observability.alertManager.enabled: true
189+
observability.slo.enabled: true
187190
uiSettings.overrides.home:useNewHomePage: true
188191
uiSettings.overrides.query:enhancements:enabled: true
189192
uiSettings.overrides.explore:experimental: true

0 commit comments

Comments
 (0)