feat: add infrahub-observability chart by FragmentedPacket · Pull Request #62 · opsmill/infrahub-helm

FragmentedPacket · 2026-05-14T13:14:59Z

Summary

New infrahub-observability Helm chart bundling Alloy + Loki + Tempo + Prometheus + Grafana + the Prefect prometheus exporter for Kubernetes installs, designed to sit alongside the existing infrahub / infrahub-enterprise releases.

What's included

The chart

Vendors seven Grafana dashboards from opsmill/infrahub@infrahub-v1.9.3 via scripts/sync-dashboards.sh
Ships our own Alloy config.alloy (logs via loki.source.kubernetes, metrics via static + node-discovery scrapes, traces forwarded by OTLP gRPC)
Provisions Grafana datasources and dashboards via sidecar-watched ConfigMaps
Ships an in-chart Deployment + Service for prefecthq/prometheus-prefect-exporter (no upstream chart exists)
Scrapes kubelet /metrics/cadvisor via the API-server node-proxy for per-container CPU/memory/network/fs metrics (toggleable via alloy.cadvisor.enabled)
Post-install NOTES.txt with the wiring snippet for tracing

Cross-chart

Adds an opt-in global.tracing block to the infrahub chart so users can wire OTLP env vars onto server + task-worker with a single flag
Emits the standard OTEL_EXPORTER_OTLP_INSECURE env var alongside INFRAHUB_TRACE_* because upstream infrahub's gRPC exporter init doesn't forward the insecure setting to the OTel SDK (TLS handshake fails against plaintext collectors otherwise)

Dashboard automation pipeline

scripts/transform_dashboard.py rewrites docker-compose labels to Kubernetes labels (container_label_com_docker_compose_service → container, etc.) on every sync. Idempotent and byte-stable.
scripts/validate_dashboards.py static-checks every dashboard's PromQL against an allowlist (scripts/known-metrics.yaml). Hard-fails on known-broken tokens, soft-warns on metrics we don't collect.
.github/workflows/dashboard-drift-check.yml runs the sync weekly and auto-opens a draft PR if upstream drifted.

CI

Three chart-scoped helm-lint jobs gated by dorny/paths-filter so PRs to one chart don't re-pull every chart's dependencies
New observability job also runs the dashboard validator

Test plan

make lint passes locally for all three charts
helm template test charts/infrahub-observability renders cleanly
All dashboard ConfigMaps stay under 1 MiB (largest is loki_monitoring at ~455 KB)
Manual install into a local Kubernetes cluster: Grafana reachable, all three datasources healthy, all seven dashboards visible, logs flowing into Loki (7 components), metrics flowing into Prometheus (8 scrape jobs reporting), traces from infrahub-server queryable in Tempo
cAdvisor scrape verified healthy via Alloy debug UI; up{job="cadvisor"} returns 1 (data not visible on OrbStack specifically; works on other distros — documented in docs/local-testing-observability.md)
Validator catches injected denied tokens (hard fail) and surfaces 18 meaningful soft warnings on the current dashboards
Transform script is idempotent (verified by repeated sync runs)

Followups

End-user installation docs for the observability stack in the opsmill/infrahub repo
Optional: kube-state-metrics integration for richer Prefect dashboard panels (currently warned-on)
Optional: enable Neo4j prometheus scrape (the chart's commented-out scrape block has the recipe)

Adds the skeleton for a new infrahub-observability Helm chart that deploys the same observability stack Infrahub ships for local Docker Compose development (Alloy + Loki + Tempo + Prometheus + Grafana + Prefect exporter) onto Kubernetes alongside the infrahub / infrahub-enterprise charts. Includes Chart.yaml with subchart dependencies, base values.yaml, shared helpers, vendored Grafana dashboards plus the sync script that keeps them in lockstep with upstream infrahub releases, README template, and Makefile targets for lint/template/sync. Templates that actually wire up the components land in follow-up commits per the plan in docs/plans/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Alloy subchart's auto-generated config doesn't know about our Loki/Prometheus/Tempo service names or our Kubernetes log-discovery needs, so we set its create=false and supply our own ConfigMap. Adapts the upstream opsmill/infrahub@infrahub-v1.9.3 Alloy config to Kubernetes: Docker discovery replaced with discovery.kubernetes role=pod, log-pipeline selectors switched from container-name patterns to the chart's component label, and scrape targets rewritten to use the chart-rendered Kubernetes Service names (templated via the helpers in _helpers.tpl). The Alloy subchart auto-resolves configMap.name to <release>-alloy when empty, so we update the alloyConfigMapName helper to mirror that logic exactly and leave the field unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adapts opsmill/infrahub@infrahub-v1.9.3 development/grafana/provisioning/ datasources/datasource.yml to Kubernetes by rewriting Docker hostnames through the chart's URL helpers. UIDs (localprometheus / localloki / localtempo) are kept so the vendored dashboards continue to resolve their datasource references unchanged. Fixes the tempoUrl helper port — the Tempo single-binary subchart serves its HTTP query API on 3100, not 3200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…r ConfigMaps Iterates Files.Glob over dashboards/*.json and emits one ConfigMap per dashboard rather than bundling them — the seven vendored dashboards already total ~900 KiB and loki_monitoring.json alone is ~390 KiB, so a single ConfigMap would risk hitting etcd's 1 MiB-per-object limit. Each ConfigMap carries the grafana_dashboard=1 label that Grafana's sidecar watches for. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…annotations Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ing tips Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an opt-in global.tracing block on the infrahub chart that emits INFRAHUB_TRACE_* env vars (matching upstream TraceSettings env_prefix) onto the server and task-worker Deployments. Defaults to disabled so existing users are unaffected; users of the new infrahub-observability chart can flip it on with one flag and point at the chart's Tempo endpoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Also regenerate infrahub README to document the new global.tracing block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cluster-agnostic walkthrough covering install of both charts, wiring the new global.tracing block, and verifying logs/metrics/traces all reach Grafana. Includes troubleshooting tips for the common failure modes (missing StorageClass, sidecar not picking up ConfigMaps, OTLP receiver). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… args Found during end-to-end install testing against a real cluster: * Prefect: the prefect-server subchart's Service is fixed-named "prefect-server" (no release prefix), not "<release>-task-manager-server" as previously assumed. Both the Alloy scrape and the prefectApiUrl helper pointed at a non-existent host, causing the Prefect exporter to CrashLoop and the task-manager scrape to silently fail. * RabbitMQ: the Bitnami rabbitmq chart exposes prometheus metrics on port 9419 (named "metrics"), not the upstream rabbit-prom-plugin default of 15692. Alloy's message-queue scrape target was hitting a closed port. * Neo4j: the Neo4j chart does not expose prometheus metrics by default, so port 2004 is never open. Commented the scrape out with a note on what to enable upstream to bring it back. * Tempo metricsGenerator: shipping with enabled=true and remoteWriteUrl="" causes Tempo to crash on startup with "url for remote_write is empty". Default to disabled with a note on how to turn it on. * Prometheus extraArgs: web.enable-lifecycle and storage.tsdb.retention.time are already set by the subchart defaults, so passing them via extraArgs produces duplicate-flag errors. Use server.retention for retention and reserve extraArgs for non-default flags only. * Alloy log-pipeline relabel: infrahub subchart pods (cache, database, message-queue, postgres) carry only `infrahub/service:` labels, not the top-level `service:` label. Added a fallback so their logs still get a `component` label in Loki. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ACE_* Upstream infrahub's create_tracer_provider() (backend/infrahub/trace.py) constructs the OTLP gRPC exporter without forwarding the `insecure` setting it reads from INFRAHUB_TRACE_INSECURE. Without that, the OTel gRPC client defaults to TLS and fails the handshake against a plaintext OTLP collector (e.g. the Tempo shipped by infrahub-observability), with errors like "SSL_ERROR_SSL ... WRONG_VERSION_NUMBER". Workaround: also emit OTEL_EXPORTER_OTLP_INSECURE and OTEL_EXPORTER_OTLP_TRACES_INSECURE so the OTel SDK itself honours the setting even when infrahub's wrapper code forgets to pass it through. Verified end-to-end against a real Kubernetes cluster: with the workaround in place, traces from infrahub-server now reach Tempo and are queryable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…y details After end-to-end install testing, update the doc with: * Actual pod-name list (postgres + prefect-server are present, rabbit pod is named -message-queue-0 not -rabbitmq-0) * Note that Prometheus has no scrape targets of its own (Alloy pushes via remote-write); verify via metric existence not /targets * New section "Service discovery and toggles" explaining how logs/metrics/ traces are discovered, the label conventions (`service:` vs `infrahub/service:`), and the toggle surface * Working trace-verification commands (the previous /api/storage/object curl returns 404 — replaced with GraphQL queries) * Troubleshooting note on the OTEL_EXPORTER_OTLP_INSECURE TLS gotcha Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures the four-section design we brainstormed: scrape kubelet cAdvisor (with the nodes/proxy RBAC it needs), post-sync transform script that rewrites Docker-compose labels to K8s labels, and a static query allowlist validator wired into CI to catch future drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an Alloy scrape for the kubelet's /metrics/cadvisor endpoint (via the API server's nodes/proxy path), giving us the per-container CPU, memory, network, and filesystem metrics needed by the Container Resources and Neo4j Monitoring dashboards. Discovery uses role=node so the scrape covers every node in the cluster. TLS uses the in-cluster CA + serviceaccount token. The Alloy subchart's default ClusterRole already grants get on nodes/proxy — no new RBAC. Toggleable via alloy.cadvisor.enabled (default true). Disable if your cluster policy forbids nodes/proxy or if you have a separate cAdvisor scrape via kube-prometheus-stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…daptation scripts/transform_dashboard.py rewrites vendored Grafana dashboards from their docker-compose form to a Kubernetes-friendly form: - container_label_com_docker_compose_service -> container - container_label_com_docker_compose_project -> namespace - id!="" (cgroup-root guard) -> container!="", image!="" (k8s equivalent) - selector values: database -> neo4j, cache -> redis, message-queue -> rabbitmq, task-manager-db -> postgresql - legendFormat templates referencing the renamed labels The transform is idempotent (re-running on already-transformed JSON is a no-op) and byte-stable (it leaves the original bytes alone when no content changes), which keeps PR diffs focused on real changes. scripts/sync-dashboards.sh now pipes each fetched dashboard through the transform before writing it to charts/infrahub-observability/dashboards/. Upstream remains the only source of truth; we never edit the vendored JSONs by hand. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Regenerated via scripts/sync-dashboards.sh — picks up the new transform_dashboard.py pipeline. Only container_resources and neo4j_monitoring change because they're the dashboards that use Docker- specific labels. The other five files are byte-identical to upstream since they don't reference docker-compose container labels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

scripts/validate_dashboards.py walks every dashboard JSON, extracts the PromQL expressions, and checks two things: * Hard fail on any token in `denied_tokens` (e.g. `container_label_com_docker_compose_service`). These are tokens that should never appear in a committed dashboard — if they do, the transform pipeline failed or upstream introduced a new pattern we don't yet handle. * Soft warn on metrics that aren't matched by any `collected_prefixes` glob in `scripts/known-metrics.yaml`. Surfaces "this panel won't have data without an extra scrape" cases — currently 18 such warnings, all genuine (Neo4j metrics not collected by default, kube-state-metrics recording rules referenced by Prefect dashboards, etc.). Runs in <1s without a cluster. Designed for CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Splits the monolithic helm-lint job into three chart-scoped jobs gated by dorny/paths-filter so a PR that only touches charts/infrahub/ no longer re-pulls the six observability subchart dependencies. Enterprise lint runs when either its own files or the infrahub chart it depends on change. The observability job now also runs scripts/validate_dashboards.py to catch the class of breakage we found during manual testing (vendored dashboards referencing labels we don't collect). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

New workflow .github/workflows/dashboard-drift-check.yml runs weekly (and on workflow_dispatch with optional REF input). It re-runs sync-dashboards.sh and, if upstream has changed the JSON at the recorded ref (or if our transform pipeline drifted), opens a draft PR with the re-synced content for human review. Uses peter-evans/create-pull-request@v7 with a fixed branch name so subsequent runs update the same PR rather than spawning new ones. The PR body includes a review checklist covering the common review concerns. Also updates docs/local-testing-observability.md to document the cAdvisor scrape and the OrbStack-specific limitation discovered during testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

FragmentedPacket and others added 21 commits May 12, 2026 11:54

feat(observability): add Prefect prometheus exporter Deployment

e122fcd

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(observability): expose Prefect exporter via Service with scrape …

3680c9c

…annotations Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(observability): add post-install NOTES with port-forward and wir…

8b26984

…ing tips Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ci: lint infrahub-observability chart in CI

16a33c0

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(observability): render README from gotmpl

7c8b682

Also regenerate infrahub README to document the new global.tracing block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add infrahub-observability chart#62

feat: add infrahub-observability chart#62
FragmentedPacket wants to merge 21 commits into
stablefrom
feat/infrahub-observability-chart

FragmentedPacket commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FragmentedPacket commented May 14, 2026

Summary

What's included

Test plan

Followups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant