Skip to content

feat: add infrahub-observability chart#62

Draft
FragmentedPacket wants to merge 21 commits into
stablefrom
feat/infrahub-observability-chart
Draft

feat: add infrahub-observability chart#62
FragmentedPacket wants to merge 21 commits into
stablefrom
feat/infrahub-observability-chart

Conversation

@FragmentedPacket
Copy link
Copy Markdown
Contributor

Summary

New infrahub-observability Helm chart bundling Alloy + Loki + Tempo + Prometheus + Grafana + the Prefect prometheus exporter for Kubernetes installs, designed to sit alongside the existing infrahub / infrahub-enterprise releases.

What's included

The chart

  • Vendors seven Grafana dashboards from opsmill/infrahub@infrahub-v1.9.3 via scripts/sync-dashboards.sh
  • Ships our own Alloy config.alloy (logs via loki.source.kubernetes, metrics via static + node-discovery scrapes, traces forwarded by OTLP gRPC)
  • Provisions Grafana datasources and dashboards via sidecar-watched ConfigMaps
  • Ships an in-chart Deployment + Service for prefecthq/prometheus-prefect-exporter (no upstream chart exists)
  • Scrapes kubelet /metrics/cadvisor via the API-server node-proxy for per-container CPU/memory/network/fs metrics (toggleable via alloy.cadvisor.enabled)
  • Post-install NOTES.txt with the wiring snippet for tracing

Cross-chart

  • Adds an opt-in global.tracing block to the infrahub chart so users can wire OTLP env vars onto server + task-worker with a single flag
  • Emits the standard OTEL_EXPORTER_OTLP_INSECURE env var alongside INFRAHUB_TRACE_* because upstream infrahub's gRPC exporter init doesn't forward the insecure setting to the OTel SDK (TLS handshake fails against plaintext collectors otherwise)

Dashboard automation pipeline

  • scripts/transform_dashboard.py rewrites docker-compose labels to Kubernetes labels (container_label_com_docker_compose_servicecontainer, etc.) on every sync. Idempotent and byte-stable.
  • scripts/validate_dashboards.py static-checks every dashboard's PromQL against an allowlist (scripts/known-metrics.yaml). Hard-fails on known-broken tokens, soft-warns on metrics we don't collect.
  • .github/workflows/dashboard-drift-check.yml runs the sync weekly and auto-opens a draft PR if upstream drifted.

CI

  • Three chart-scoped helm-lint jobs gated by dorny/paths-filter so PRs to one chart don't re-pull every chart's dependencies
  • New observability job also runs the dashboard validator

Test plan

  • make lint passes locally for all three charts
  • helm template test charts/infrahub-observability renders cleanly
  • All dashboard ConfigMaps stay under 1 MiB (largest is loki_monitoring at ~455 KB)
  • Manual install into a local Kubernetes cluster: Grafana reachable, all three datasources healthy, all seven dashboards visible, logs flowing into Loki (7 components), metrics flowing into Prometheus (8 scrape jobs reporting), traces from infrahub-server queryable in Tempo
  • cAdvisor scrape verified healthy via Alloy debug UI; up{job="cadvisor"} returns 1 (data not visible on OrbStack specifically; works on other distros — documented in docs/local-testing-observability.md)
  • Validator catches injected denied tokens (hard fail) and surfaces 18 meaningful soft warnings on the current dashboards
  • Transform script is idempotent (verified by repeated sync runs)

Followups

  • End-user installation docs for the observability stack in the opsmill/infrahub repo
  • Optional: kube-state-metrics integration for richer Prefect dashboard panels (currently warned-on)
  • Optional: enable Neo4j prometheus scrape (the chart's commented-out scrape block has the recipe)

FragmentedPacket and others added 21 commits May 12, 2026 11:54
Adds the skeleton for a new infrahub-observability Helm chart that
deploys the same observability stack Infrahub ships for local Docker
Compose development (Alloy + Loki + Tempo + Prometheus + Grafana +
Prefect exporter) onto Kubernetes alongside the infrahub /
infrahub-enterprise charts.

Includes Chart.yaml with subchart dependencies, base values.yaml,
shared helpers, vendored Grafana dashboards plus the sync script that
keeps them in lockstep with upstream infrahub releases, README
template, and Makefile targets for lint/template/sync. Templates that
actually wire up the components land in follow-up commits per the
plan in docs/plans/.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Alloy subchart's auto-generated config doesn't know about our
Loki/Prometheus/Tempo service names or our Kubernetes log-discovery
needs, so we set its create=false and supply our own ConfigMap.

Adapts the upstream opsmill/infrahub@infrahub-v1.9.3 Alloy config to
Kubernetes: Docker discovery replaced with discovery.kubernetes role=pod,
log-pipeline selectors switched from container-name patterns to the
chart's component label, and scrape targets rewritten to use the
chart-rendered Kubernetes Service names (templated via the helpers in
_helpers.tpl).

The Alloy subchart auto-resolves configMap.name to <release>-alloy when
empty, so we update the alloyConfigMapName helper to mirror that logic
exactly and leave the field unset.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adapts opsmill/infrahub@infrahub-v1.9.3 development/grafana/provisioning/
datasources/datasource.yml to Kubernetes by rewriting Docker hostnames
through the chart's URL helpers. UIDs (localprometheus / localloki /
localtempo) are kept so the vendored dashboards continue to resolve their
datasource references unchanged.

Fixes the tempoUrl helper port — the Tempo single-binary subchart serves
its HTTP query API on 3100, not 3200.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r ConfigMaps

Iterates Files.Glob over dashboards/*.json and emits one ConfigMap per
dashboard rather than bundling them — the seven vendored dashboards
already total ~900 KiB and loki_monitoring.json alone is ~390 KiB, so a
single ConfigMap would risk hitting etcd's 1 MiB-per-object limit. Each
ConfigMap carries the grafana_dashboard=1 label that Grafana's sidecar
watches for.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…annotations

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ing tips

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in global.tracing block on the infrahub chart that emits
INFRAHUB_TRACE_* env vars (matching upstream TraceSettings env_prefix) onto
the server and task-worker Deployments. Defaults to disabled so existing
users are unaffected; users of the new infrahub-observability chart can flip
it on with one flag and point at the chart's Tempo endpoint.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Also regenerate infrahub README to document the new global.tracing block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cluster-agnostic walkthrough covering install of both charts, wiring the
new global.tracing block, and verifying logs/metrics/traces all reach
Grafana. Includes troubleshooting tips for the common failure modes
(missing StorageClass, sidecar not picking up ConfigMaps, OTLP receiver).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… args

Found during end-to-end install testing against a real cluster:

* Prefect: the prefect-server subchart's Service is fixed-named
  "prefect-server" (no release prefix), not "<release>-task-manager-server"
  as previously assumed. Both the Alloy scrape and the prefectApiUrl
  helper pointed at a non-existent host, causing the Prefect exporter to
  CrashLoop and the task-manager scrape to silently fail.
* RabbitMQ: the Bitnami rabbitmq chart exposes prometheus metrics on port
  9419 (named "metrics"), not the upstream rabbit-prom-plugin default of
  15692. Alloy's message-queue scrape target was hitting a closed port.
* Neo4j: the Neo4j chart does not expose prometheus metrics by default,
  so port 2004 is never open. Commented the scrape out with a note on
  what to enable upstream to bring it back.
* Tempo metricsGenerator: shipping with enabled=true and remoteWriteUrl=""
  causes Tempo to crash on startup with "url for remote_write is empty".
  Default to disabled with a note on how to turn it on.
* Prometheus extraArgs: web.enable-lifecycle and storage.tsdb.retention.time
  are already set by the subchart defaults, so passing them via extraArgs
  produces duplicate-flag errors. Use server.retention for retention and
  reserve extraArgs for non-default flags only.
* Alloy log-pipeline relabel: infrahub subchart pods (cache, database,
  message-queue, postgres) carry only `infrahub/service:` labels, not the
  top-level `service:` label. Added a fallback so their logs still get a
  `component` label in Loki.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ACE_*

Upstream infrahub's create_tracer_provider() (backend/infrahub/trace.py)
constructs the OTLP gRPC exporter without forwarding the `insecure`
setting it reads from INFRAHUB_TRACE_INSECURE. Without that, the OTel
gRPC client defaults to TLS and fails the handshake against a plaintext
OTLP collector (e.g. the Tempo shipped by infrahub-observability), with
errors like "SSL_ERROR_SSL ... WRONG_VERSION_NUMBER".

Workaround: also emit OTEL_EXPORTER_OTLP_INSECURE and
OTEL_EXPORTER_OTLP_TRACES_INSECURE so the OTel SDK itself honours the
setting even when infrahub's wrapper code forgets to pass it through.

Verified end-to-end against a real Kubernetes cluster: with the workaround
in place, traces from infrahub-server now reach Tempo and are queryable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y details

After end-to-end install testing, update the doc with:
* Actual pod-name list (postgres + prefect-server are present, rabbit pod
  is named -message-queue-0 not -rabbitmq-0)
* Note that Prometheus has no scrape targets of its own (Alloy pushes via
  remote-write); verify via metric existence not /targets
* New section "Service discovery and toggles" explaining how logs/metrics/
  traces are discovered, the label conventions (`service:` vs
  `infrahub/service:`), and the toggle surface
* Working trace-verification commands (the previous /api/storage/object
  curl returns 404 — replaced with GraphQL queries)
* Troubleshooting note on the OTEL_EXPORTER_OTLP_INSECURE TLS gotcha

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the four-section design we brainstormed: scrape kubelet cAdvisor
(with the nodes/proxy RBAC it needs), post-sync transform script that
rewrites Docker-compose labels to K8s labels, and a static query
allowlist validator wired into CI to catch future drift.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an Alloy scrape for the kubelet's /metrics/cadvisor endpoint (via the
API server's nodes/proxy path), giving us the per-container CPU, memory,
network, and filesystem metrics needed by the Container Resources and
Neo4j Monitoring dashboards.

Discovery uses role=node so the scrape covers every node in the cluster.
TLS uses the in-cluster CA + serviceaccount token. The Alloy subchart's
default ClusterRole already grants get on nodes/proxy — no new RBAC.

Toggleable via alloy.cadvisor.enabled (default true). Disable if your
cluster policy forbids nodes/proxy or if you have a separate cAdvisor
scrape via kube-prometheus-stack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…daptation

scripts/transform_dashboard.py rewrites vendored Grafana dashboards from
their docker-compose form to a Kubernetes-friendly form:
  - container_label_com_docker_compose_service -> container
  - container_label_com_docker_compose_project -> namespace
  - id!="" (cgroup-root guard) -> container!="", image!="" (k8s equivalent)
  - selector values: database -> neo4j, cache -> redis, message-queue ->
    rabbitmq, task-manager-db -> postgresql
  - legendFormat templates referencing the renamed labels

The transform is idempotent (re-running on already-transformed JSON is a
no-op) and byte-stable (it leaves the original bytes alone when no
content changes), which keeps PR diffs focused on real changes.

scripts/sync-dashboards.sh now pipes each fetched dashboard through the
transform before writing it to charts/infrahub-observability/dashboards/.
Upstream remains the only source of truth; we never edit the vendored
JSONs by hand.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regenerated via scripts/sync-dashboards.sh — picks up the new
transform_dashboard.py pipeline. Only container_resources and
neo4j_monitoring change because they're the dashboards that use Docker-
specific labels. The other five files are byte-identical to upstream
since they don't reference docker-compose container labels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/validate_dashboards.py walks every dashboard JSON, extracts the
PromQL expressions, and checks two things:

* Hard fail on any token in `denied_tokens` (e.g.
  `container_label_com_docker_compose_service`). These are tokens that
  should never appear in a committed dashboard — if they do, the
  transform pipeline failed or upstream introduced a new pattern we
  don't yet handle.

* Soft warn on metrics that aren't matched by any `collected_prefixes`
  glob in `scripts/known-metrics.yaml`. Surfaces "this panel won't have
  data without an extra scrape" cases — currently 18 such warnings,
  all genuine (Neo4j metrics not collected by default, kube-state-metrics
  recording rules referenced by Prefect dashboards, etc.).

Runs in <1s without a cluster. Designed for CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits the monolithic helm-lint job into three chart-scoped jobs gated by
dorny/paths-filter so a PR that only touches charts/infrahub/ no longer
re-pulls the six observability subchart dependencies. Enterprise lint
runs when either its own files or the infrahub chart it depends on
change.

The observability job now also runs scripts/validate_dashboards.py to
catch the class of breakage we found during manual testing (vendored
dashboards referencing labels we don't collect).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New workflow .github/workflows/dashboard-drift-check.yml runs weekly (and
on workflow_dispatch with optional REF input). It re-runs sync-dashboards.sh
and, if upstream has changed the JSON at the recorded ref (or if our
transform pipeline drifted), opens a draft PR with the re-synced content
for human review.

Uses peter-evans/create-pull-request@v7 with a fixed branch name so
subsequent runs update the same PR rather than spawning new ones. The PR
body includes a review checklist covering the common review concerns.

Also updates docs/local-testing-observability.md to document the cAdvisor
scrape and the OrbStack-specific limitation discovered during testing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant