feat: add infrahub-observability chart#62
Draft
FragmentedPacket wants to merge 21 commits into
Draft
Conversation
Adds the skeleton for a new infrahub-observability Helm chart that deploys the same observability stack Infrahub ships for local Docker Compose development (Alloy + Loki + Tempo + Prometheus + Grafana + Prefect exporter) onto Kubernetes alongside the infrahub / infrahub-enterprise charts. Includes Chart.yaml with subchart dependencies, base values.yaml, shared helpers, vendored Grafana dashboards plus the sync script that keeps them in lockstep with upstream infrahub releases, README template, and Makefile targets for lint/template/sync. Templates that actually wire up the components land in follow-up commits per the plan in docs/plans/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Alloy subchart's auto-generated config doesn't know about our Loki/Prometheus/Tempo service names or our Kubernetes log-discovery needs, so we set its create=false and supply our own ConfigMap. Adapts the upstream opsmill/infrahub@infrahub-v1.9.3 Alloy config to Kubernetes: Docker discovery replaced with discovery.kubernetes role=pod, log-pipeline selectors switched from container-name patterns to the chart's component label, and scrape targets rewritten to use the chart-rendered Kubernetes Service names (templated via the helpers in _helpers.tpl). The Alloy subchart auto-resolves configMap.name to <release>-alloy when empty, so we update the alloyConfigMapName helper to mirror that logic exactly and leave the field unset. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adapts opsmill/infrahub@infrahub-v1.9.3 development/grafana/provisioning/ datasources/datasource.yml to Kubernetes by rewriting Docker hostnames through the chart's URL helpers. UIDs (localprometheus / localloki / localtempo) are kept so the vendored dashboards continue to resolve their datasource references unchanged. Fixes the tempoUrl helper port — the Tempo single-binary subchart serves its HTTP query API on 3100, not 3200. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r ConfigMaps Iterates Files.Glob over dashboards/*.json and emits one ConfigMap per dashboard rather than bundling them — the seven vendored dashboards already total ~900 KiB and loki_monitoring.json alone is ~390 KiB, so a single ConfigMap would risk hitting etcd's 1 MiB-per-object limit. Each ConfigMap carries the grafana_dashboard=1 label that Grafana's sidecar watches for. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…annotations Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ing tips Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in global.tracing block on the infrahub chart that emits INFRAHUB_TRACE_* env vars (matching upstream TraceSettings env_prefix) onto the server and task-worker Deployments. Defaults to disabled so existing users are unaffected; users of the new infrahub-observability chart can flip it on with one flag and point at the chart's Tempo endpoint. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Also regenerate infrahub README to document the new global.tracing block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cluster-agnostic walkthrough covering install of both charts, wiring the new global.tracing block, and verifying logs/metrics/traces all reach Grafana. Includes troubleshooting tips for the common failure modes (missing StorageClass, sidecar not picking up ConfigMaps, OTLP receiver). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… args Found during end-to-end install testing against a real cluster: * Prefect: the prefect-server subchart's Service is fixed-named "prefect-server" (no release prefix), not "<release>-task-manager-server" as previously assumed. Both the Alloy scrape and the prefectApiUrl helper pointed at a non-existent host, causing the Prefect exporter to CrashLoop and the task-manager scrape to silently fail. * RabbitMQ: the Bitnami rabbitmq chart exposes prometheus metrics on port 9419 (named "metrics"), not the upstream rabbit-prom-plugin default of 15692. Alloy's message-queue scrape target was hitting a closed port. * Neo4j: the Neo4j chart does not expose prometheus metrics by default, so port 2004 is never open. Commented the scrape out with a note on what to enable upstream to bring it back. * Tempo metricsGenerator: shipping with enabled=true and remoteWriteUrl="" causes Tempo to crash on startup with "url for remote_write is empty". Default to disabled with a note on how to turn it on. * Prometheus extraArgs: web.enable-lifecycle and storage.tsdb.retention.time are already set by the subchart defaults, so passing them via extraArgs produces duplicate-flag errors. Use server.retention for retention and reserve extraArgs for non-default flags only. * Alloy log-pipeline relabel: infrahub subchart pods (cache, database, message-queue, postgres) carry only `infrahub/service:` labels, not the top-level `service:` label. Added a fallback so their logs still get a `component` label in Loki. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ACE_* Upstream infrahub's create_tracer_provider() (backend/infrahub/trace.py) constructs the OTLP gRPC exporter without forwarding the `insecure` setting it reads from INFRAHUB_TRACE_INSECURE. Without that, the OTel gRPC client defaults to TLS and fails the handshake against a plaintext OTLP collector (e.g. the Tempo shipped by infrahub-observability), with errors like "SSL_ERROR_SSL ... WRONG_VERSION_NUMBER". Workaround: also emit OTEL_EXPORTER_OTLP_INSECURE and OTEL_EXPORTER_OTLP_TRACES_INSECURE so the OTel SDK itself honours the setting even when infrahub's wrapper code forgets to pass it through. Verified end-to-end against a real Kubernetes cluster: with the workaround in place, traces from infrahub-server now reach Tempo and are queryable. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…y details After end-to-end install testing, update the doc with: * Actual pod-name list (postgres + prefect-server are present, rabbit pod is named -message-queue-0 not -rabbitmq-0) * Note that Prometheus has no scrape targets of its own (Alloy pushes via remote-write); verify via metric existence not /targets * New section "Service discovery and toggles" explaining how logs/metrics/ traces are discovered, the label conventions (`service:` vs `infrahub/service:`), and the toggle surface * Working trace-verification commands (the previous /api/storage/object curl returns 404 — replaced with GraphQL queries) * Troubleshooting note on the OTEL_EXPORTER_OTLP_INSECURE TLS gotcha Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Captures the four-section design we brainstormed: scrape kubelet cAdvisor (with the nodes/proxy RBAC it needs), post-sync transform script that rewrites Docker-compose labels to K8s labels, and a static query allowlist validator wired into CI to catch future drift. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an Alloy scrape for the kubelet's /metrics/cadvisor endpoint (via the API server's nodes/proxy path), giving us the per-container CPU, memory, network, and filesystem metrics needed by the Container Resources and Neo4j Monitoring dashboards. Discovery uses role=node so the scrape covers every node in the cluster. TLS uses the in-cluster CA + serviceaccount token. The Alloy subchart's default ClusterRole already grants get on nodes/proxy — no new RBAC. Toggleable via alloy.cadvisor.enabled (default true). Disable if your cluster policy forbids nodes/proxy or if you have a separate cAdvisor scrape via kube-prometheus-stack. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…daptation
scripts/transform_dashboard.py rewrites vendored Grafana dashboards from
their docker-compose form to a Kubernetes-friendly form:
- container_label_com_docker_compose_service -> container
- container_label_com_docker_compose_project -> namespace
- id!="" (cgroup-root guard) -> container!="", image!="" (k8s equivalent)
- selector values: database -> neo4j, cache -> redis, message-queue ->
rabbitmq, task-manager-db -> postgresql
- legendFormat templates referencing the renamed labels
The transform is idempotent (re-running on already-transformed JSON is a
no-op) and byte-stable (it leaves the original bytes alone when no
content changes), which keeps PR diffs focused on real changes.
scripts/sync-dashboards.sh now pipes each fetched dashboard through the
transform before writing it to charts/infrahub-observability/dashboards/.
Upstream remains the only source of truth; we never edit the vendored
JSONs by hand.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regenerated via scripts/sync-dashboards.sh — picks up the new transform_dashboard.py pipeline. Only container_resources and neo4j_monitoring change because they're the dashboards that use Docker- specific labels. The other five files are byte-identical to upstream since they don't reference docker-compose container labels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
scripts/validate_dashboards.py walks every dashboard JSON, extracts the PromQL expressions, and checks two things: * Hard fail on any token in `denied_tokens` (e.g. `container_label_com_docker_compose_service`). These are tokens that should never appear in a committed dashboard — if they do, the transform pipeline failed or upstream introduced a new pattern we don't yet handle. * Soft warn on metrics that aren't matched by any `collected_prefixes` glob in `scripts/known-metrics.yaml`. Surfaces "this panel won't have data without an extra scrape" cases — currently 18 such warnings, all genuine (Neo4j metrics not collected by default, kube-state-metrics recording rules referenced by Prefect dashboards, etc.). Runs in <1s without a cluster. Designed for CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits the monolithic helm-lint job into three chart-scoped jobs gated by dorny/paths-filter so a PR that only touches charts/infrahub/ no longer re-pulls the six observability subchart dependencies. Enterprise lint runs when either its own files or the infrahub chart it depends on change. The observability job now also runs scripts/validate_dashboards.py to catch the class of breakage we found during manual testing (vendored dashboards referencing labels we don't collect). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New workflow .github/workflows/dashboard-drift-check.yml runs weekly (and on workflow_dispatch with optional REF input). It re-runs sync-dashboards.sh and, if upstream has changed the JSON at the recorded ref (or if our transform pipeline drifted), opens a draft PR with the re-synced content for human review. Uses peter-evans/create-pull-request@v7 with a fixed branch name so subsequent runs update the same PR rather than spawning new ones. The PR body includes a review checklist covering the common review concerns. Also updates docs/local-testing-observability.md to document the cAdvisor scrape and the OrbStack-specific limitation discovered during testing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
New
infrahub-observabilityHelm chart bundling Alloy + Loki + Tempo + Prometheus + Grafana + the Prefect prometheus exporter for Kubernetes installs, designed to sit alongside the existinginfrahub/infrahub-enterprisereleases.What's included
The chart
opsmill/infrahub@infrahub-v1.9.3viascripts/sync-dashboards.shconfig.alloy(logs vialoki.source.kubernetes, metrics via static + node-discovery scrapes, traces forwarded by OTLP gRPC)prefecthq/prometheus-prefect-exporter(no upstream chart exists)/metrics/cadvisorvia the API-server node-proxy for per-container CPU/memory/network/fs metrics (toggleable viaalloy.cadvisor.enabled)NOTES.txtwith the wiring snippet for tracingCross-chart
global.tracingblock to theinfrahubchart so users can wire OTLP env vars onto server + task-worker with a single flagOTEL_EXPORTER_OTLP_INSECUREenv var alongsideINFRAHUB_TRACE_*because upstream infrahub's gRPC exporter init doesn't forward the insecure setting to the OTel SDK (TLS handshake fails against plaintext collectors otherwise)Dashboard automation pipeline
scripts/transform_dashboard.pyrewrites docker-compose labels to Kubernetes labels (container_label_com_docker_compose_service→container, etc.) on every sync. Idempotent and byte-stable.scripts/validate_dashboards.pystatic-checks every dashboard's PromQL against an allowlist (scripts/known-metrics.yaml). Hard-fails on known-broken tokens, soft-warns on metrics we don't collect..github/workflows/dashboard-drift-check.ymlruns the sync weekly and auto-opens a draft PR if upstream drifted.CI
dorny/paths-filterso PRs to one chart don't re-pull every chart's dependenciesTest plan
make lintpasses locally for all three chartshelm template test charts/infrahub-observabilityrenders cleanlyinfrahub-serverqueryable in Tempoup{job="cadvisor"}returns 1 (data not visible on OrbStack specifically; works on other distros — documented indocs/local-testing-observability.md)Followups
opsmill/infrahubrepo