Skip to content

[codex] Document Grafana and OpenTelemetry on Control Plane#352

Merged
justin808 merged 30 commits into
mainfrom
docs/grafana-opentelemetry
Jul 1, 2026
Merged

[codex] Document Grafana and OpenTelemetry on Control Plane#352
justin808 merged 30 commits into
mainfrom
docs/grafana-opentelemetry

Conversation

@justin808

@justin808 justin808 commented Jun 2, 2026

Copy link
Copy Markdown
Member

Summary

Adds a generic Grafana/OpenTelemetry guide for Control Plane Flow.

The guide covers:

  • where OpenTelemetry complements Control Plane Grafana platform metrics
  • a recommended internal collector architecture
  • Rails app opt-in setup behind ENABLE_OPEN_TELEMETRY
  • Control Plane resource attributes to whitelist onto telemetry
  • collector workload/config shape
  • dashboard and alert starting points
  • rollout and validation checklists
  • safety notes for dashboards, alerts, production rollout, and secret handling

Also adds navigation links from README.md and docs/tips.md.

Safety

This is docs-only guidance. It does not apply templates, change Control Plane settings, save Grafana dashboards, edit alert rules, or touch any runtime environment.

The content is intentionally generic and avoids tenant-specific org names, app names, dashboard URLs, or secret values.

Validation

Ran locally:

script/check_cpln_links README.md docs/tips.md docs/grafana-opentelemetry.md

Result: exit 0.

git diff --check

Result: exit 0.

Also ran a local text scan for secret-looking terms and tenant-specific names. Matches were generic placeholders or existing docs language, not literal secret values or app/org-specific settings.

Notes

The independent auto-review helper was not run because local policy blocked exporting repository diff/code to an external AI review service. Since this is docs-only, I used local inspection plus the link and whitespace checks above.


Note

Low Risk
Documentation-only changes with no runtime, CLI, or infrastructure behavior changes.

Overview
Adds docs/grafana-opentelemetry.md, a generic playbook for app-level observability on Control Plane: how OpenTelemetry complements built-in Grafana platform metrics, an internal collector → Prometheus metrics → Grafana scrape flow, Rails opt-in behind ENABLE_OPEN_TELEMETRY, Control Plane resource attributes, collector workload/config patterns (including spanmetrics and cardinality pitfalls), plus rollout, validation, dashboard/alert starters, and safety notes.

README.md and docs/tips.md link to the new guide (Tips gets a short intro section and renumbered table of contents). docs/commands.md only updates the update-github-actions example release tag from v5.1.0 to v5.1.1.

Reviewed by Cursor Bugbot for commit 4596591. Bugbot is set up for automated code reviews on this repo. Configure here.

@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@justin808, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 10 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9129bfe6-2a2f-4b22-9c5d-c9614fcff3fd

📥 Commits

Reviewing files that changed from the base of the PR and between 00b098a and a56caa3.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • README.md
  • docs/grafana-opentelemetry.md
  • docs/tips.md
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/grafana-opentelemetry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@claude

claude Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Code Review: docs/grafana-opentelemetry.md

This is a solid, well-structured observability guide with good safety guardrails and a sensible rollout order. The content is actionable and appropriately scoped. A few technical accuracy issues in the code examples are worth fixing before merge — they could cause silent failures or confusion for readers who copy the snippets.

Correctness issues

1. Gem list vs. config.use mismatch (high impact)
The Gemfile section includes opentelemetry-instrumentation-active_support, opentelemetry-instrumentation-faraday, and opentelemetry-instrumentation-http, but none of these three appear in the config.use block below. A reader who copies both sections verbatim installs three gems but never activates them. Either add the config.use calls or remove the gems from the Gemfile example.

2. Generic service_name default (medium impact)
ENV.fetch("OTEL_SERVICE_NAME", "rails") defaults every Rails service to the name "rails". In a multi-workload GVC, all services would share the same service.name, making dashboards unfilterable by service. The fallback should be something like ENV.fetch("CPLN_WORKLOAD_NAME", "rails-app") or at minimum the comment should call out that this default must be overridden per workload.

3. 0ms histogram bucket (low impact)
A 0ms lower boundary in an explicit histogram is semantically meaningless — all latency observations are ≥ 0ms by definition, so this bucket will always be empty. Some OTel Collector versions also reject zero-value duration boundaries. Start from 5ms or 10ms.

4. Placeholder script paths in Validation section (low impact)
spec/path/to/open_telemetry_specs.rb, ./script/check_open_telemetry_config, and ./script/validate_open_telemetry_config are generic placeholders that do not exist in this repository. Readers may run them and be confused by the command not found errors. A note that these are illustrative names to be replaced — or using # example comments — would help.

Minor suggestions

  • config.use_all: The initializer snippet activates each instrumentation with individual config.use calls. This works, but the SDK's config.use_all auto-loads every installed instrumentation gem and is the more idiomatic shorthand. Even a brief mention would help readers who are not yet familiar with the API.

  • OTEL_SERVICE_NAME env var overlap: OTEL_SERVICE_NAME is an official OpenTelemetry spec environment variable that the Ruby SDK picks up automatically. Wrapping it in ENV.fetch and passing it to config.service_name = works, but readers may not realize the SDK already handles this env var natively. A short note avoids confusion.

What's good

  • Architecture diagram is clear and shows the internal-only collector correctly.
  • The template-promotion rule ("wait until two apps need the same shape") is practical and guards against premature abstraction.
  • Dashboard and alert checklists are genuinely useful and not obvious.
  • The cardinality safety notes are important and well placed.
  • Rollout order is cautious: non-production first, then explicit production approval.

Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md Outdated
@justin808 justin808 requested review from borela, dzirtusss and frullah June 2, 2026 03:46
justin808 added 2 commits June 3, 2026 16:00
…metry

* origin/main:
  Bump version to 5.1.1
  Update CHANGELOG.md for 5.1.1 (#362)
  Replace maintenance mode sleep with domain polling (#337)
  Handle completed runner jobs before replicas appear
  Warn when staging image lacks commit suffix
  Address production promotion review follow-ups
  Harden production promotion image copy and readiness
  Add Control Plane Flow logo and brand icon assets (#358)
  Bump version to 5.1.0
  Update CHANGELOG.md for 5.1.0 (#357)
  Regenerate docs/commands.md for cpflow 5.0.4 (#344)
  Add shared secret grants (#354)
  Harden production promotion image copy (#355)
  Fix production promotion environment secret access (#353)
  Clarify production environment secret setup (#351)
@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Code Review

Overview: Documentation-only PR adding a Grafana/OpenTelemetry guide for Control Plane Rails apps. The structure and safety notes are solid. A few technical points in the Ruby example and supporting content are worth addressing before merging.


Technical issues

1. OTLP exporter endpoint construction

The code builds the endpoint as "#{collector}/v1/traces" where collector defaults to http://localhost:4318. This works in recent gem versions, but it couples the custom OPEN_TELEMETRY_COLLECTOR_ADDRESS env var to a manually-appended path. The standard approach is to pass the base URL only and let the gem append /v1/traces automatically, or rely on the standard OTEL_EXPORTER_OTLP_ENDPOINT env var instead of a custom one. If a reader sets OPEN_TELEMETRY_COLLECTOR_ADDRESS to a URL that already includes the path, they will hit a double-path error that is hard to diagnose.

2. Gemfile group omitted

The Gemfile example shows all OTel gems in the default group. This loads them unconditionally in test and development, adding boot overhead and requiring the gems to be present in all environments. A note recommending the production group (or conditional loading matching the ENABLE_OPEN_TELEMETRY guard) would reflect how teams typically handle this.

3. original_cpln_* attribute names unexplained

The resource attribute list uses an original_ prefix (original_cpln_org, original_cpln_gvc, etc.) without explaining where this convention comes from or how readers should derive these values from the CPLN env vars that workloads actually expose (CPLN_ORG, CPLN_GVC, etc.). One sentence connecting the env var names to the attribute names would prevent confusion.

4. OPEN_TELEMETRY_CONFIG_HASH purpose unexplained

The collector env var block lists OPEN_TELEMETRY_CONFIG_HASH: "<hash-of-config>" without explaining its purpose. This appears to be the Control Plane pattern for forcing a workload restart when a secret referenced by cpln://secret/ changes (since a secret content change alone does not trigger a restart). Without that context, readers will not know whether this is optional or why they need it.

5. config.use_all note missing required gem

The doc mentions config.use_all as an alternative to the explicit config.use list but does not mention that this requires the opentelemetry-instrumentation-all gem. A reader who removes the explicit list and calls use_all on the stock Gemfile from the example will get a load error.


Minor points

  • The Ruby initializer block does not mention the file it belongs in (e.g. config/initializers/open_telemetry.rb). A one-line comment or filename note would help readers know where to place the snippet.
  • The collector script names in the suggested .controlplane/open_telemetry/ layout (build_main_collector_config, check_main_collector_config, validate_main_collector_config) have no file extension. A note that these are shell scripts would reduce ambiguity.
  • The docs/commands.md version bump from v5.1.0 to v5.1.1 in an example is unrelated to the PR's stated purpose and would be cleaner in a separate commit.

What works well

  • Staged rollout order (non-production first, explicit production approval) is exactly right.
  • Safety notes on cardinality, secrets, and dashboard JSON are thorough and appropriately cautious.
  • Template guidance (wait for two apps before extracting shared templates) is a sensible default.
  • Keeping the guide free of org/app-specific values is the correct call for a shared docs repo.
  • Dashboard and alert checklists are practical and actionable.

Summary: The content is well-organized and covers the right topics. Addressing the endpoint construction note, Gemfile group guidance, original_cpln_* naming explanation, and config hash purpose would make this guide safe to follow without surprises.

Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md Outdated
@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Code Review: Document Grafana and OpenTelemetry on Control Plane

Overall: Well-structured, docs-only PR. The coverage is thorough — architecture, Rails setup, collector config, rollout checklist, dashboards, alerts, and safety notes all follow sensible patterns. A few technical details are worth a second look before merging.


1. Gemfile group limits dev/test use

Placing OpenTelemetry gems inside group :production means they are unavailable in development and test environments. This prevents running the app locally with ENABLE_OPEN_TELEMETRY=true and any RSpec specs that exercise the OTel initializer.

If the intent is prod-only for cost/noise reasons, a note explaining the trade-off would help. Many teams instead use the default group (or group :default, :production) with require: false so the initializer can be exercised in all environments.

2. Port 9292 conflicts with Puma/Rack default

Port 9292 is Rack/Puma's conventional development default, so recommending it for the collector's Prometheus endpoint will cause confusion in Rails projects. The OTel Prometheus exporter defaults to 8889; other common choices are 9464 or any non-conflicting 9xxx port. Consider a different port or add a note that 9292 is only an example and should not collide with existing workloads.

3. OTTL syntax in transform/normalize needs verification

The transform snippet uses span.attributes["key"] and scope.name. In standard OTTL, inside trace_statements a context: span block is usually required, and within that context attributes are written as attributes["key"], not span.attributes["key"]. The instrumentation scope accessor is instrumentation_scope.name, not scope.name. A mismatch here silently produces no output without a startup error.


Minor Observations

  • The exclude_dimensions field in the spanmetrics connector is worth double-checking — some collector releases expose it, others do not.
  • set(span.attributes["resource.name"], span.name) sets resource.name to the span name. For OTel resource-level attributes the OTTL accessor is resource.attributes["key"].
  • Rollout step 1 says "behind ENABLE_OPEN_TELEMETRY=false" but the initializer checks == "true", so the var being absent already disables it. Wording like "without setting ENABLE_OPEN_TELEMETRY=true" is less ambiguous.

What Looks Good

  • Architecture overview and signal taxonomy (Metrics/Traces/Logs) are clear and accurate.
  • The ENABLE_OPEN_TELEMETRY guard and two-level ENV.fetch endpoint fallback are good patterns.
  • Resource attribute naming conventions (original_cpln_*) and the single-helper-module guidance are solid.
  • Rollout sequencing (collector first, app second, dashboard before alerts) is operationally sound.
  • Safety notes — no secrets in dashboard JSON and low-cardinality label discipline — are the right things to emphasize.
  • The docs/tips.md insertion and README link are consistent with the existing document index.

Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md Outdated
@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Code Review

This is a well-structured, documentation-only PR. The guide is thorough and the rollout/safety checklists are practical. A couple of technical issues in the Ruby code examples worth addressing before merging.

Bug: require: false + config.use mismatch

The Gemfile block marks every instrumentation gem with require: false, but the initializer only explicitly requires opentelemetry/sdk and opentelemetry/exporter/otlp. The subsequent config.use "OpenTelemetry::Instrumentation::Rails" (and PG, Redis, etc.) calls rely on Object.const_get under the hood — if those constants are not loaded, the configure block raises a NameError at startup.

Two ways to fix:

Option A — add explicit requires in the initializer (keeps require: false in Gemfile):

require "opentelemetry/sdk"
require "opentelemetry/exporter/otlp"
require "opentelemetry/instrumentation/rails"
require "opentelemetry/instrumentation/pg"
require "opentelemetry/instrumentation/redis"
require "opentelemetry/instrumentation/sidekiq"
require "opentelemetry/instrumentation/faraday"
require "opentelemetry/instrumentation/http"

Option B — drop require: false from the instrumentation gems (Bundler auto-loads them).

Option A is more explicit and pairs well with the env-gated initializer pattern shown in the guide.


Minor: original_ attribute prefix is non-standard and unexplained

The guide recommends original_cpln_org, original_cpln_gvc, etc., but doesn't explain why original_ was chosen. The prefix reads as though the values are copies of something else. Alternatives worth considering:

  • Dot-namespace: cpln.org, cpln.gvc, cpln.workload — consistent with OTel semantic conventions (e.g., cloud.provider, k8s.pod.name).
  • Flat: cpln_org, cpln_gvc — simpler, no confusing prefix.

If original_ is intentional (e.g., to avoid collision with a future upstream OTel CPLN convention), a brief explanatory note would help readers understand the choice.


Nits

  • The Validation section references placeholder paths (spec/open_telemetry/, ./script/check_open_telemetry_config). The inline comments say "Replace with your app's actual..." — a callout block (GitHub > [!NOTE]) would make these easier to scan.
  • Worth mentioning config.use_all as a one-liner alternative to the individual config.use calls — helpful for apps that already have all instrumentation gems in their Gemfile.

Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md Outdated
@claude

claude Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Code Review

Well-structured guide overall — good rollout sequencing, checklists, and safety notes. A few technical issues worth addressing before merging.

Issues

1. OTLP Exporter endpoint mismatch (lines 117–126)
The code manually fetches OTEL_EXPORTER_OTLP_ENDPOINT then passes it as the endpoint: constructor arg to Exporter.new. These two paths behave differently in the Ruby OTLP exporter gem:

  • The OTEL_EXPORTER_OTLP_ENDPOINT env var is a base URL — the SDK automatically appends /v1/traces.
  • The endpoint: constructor param is treated as a full signal-specific URL in most gem versions, so the SDK does not append the path.

The result: if the env var is set to http://otel-collector:4318, the exporter will POST to http://otel-collector:4318 instead of http://otel-collector:4318/v1/traces — a 404 at startup. Simplest fix: remove the manual fetch and use Exporter.new with no endpoint: argument so the SDK reads the env var correctly, or explicitly append /v1/traces.

2. resource.name is a Datadog convention, not a standard OTel attribute (line 252)
The OTTL statement set(attributes["resource.name"], name) copies the span name into resource.name. This attribute name originates in Datadog APM and is not part of the OpenTelemetry semantic conventions. In a Prometheus/Grafana stack it won't be interpreted specially. Either document it as a custom/Datadog-derived attribute, or replace it with a standard attribute like span.name (already available) or a clearly namespaced custom key.

3. spanmetrics connector doesn't filter to root spans (lines 258–281)
The connector is named http_root_span_latency and the heading reads "Generate a root request latency metric", but there is no filter in the config — it will include all spans, not just root or HTTP server spans. Either add a filter (the transform processor already sets root_span=true and can be used as a condition on the pipeline that feeds this connector) or adjust the description to match the actual behavior.

4. original_cpln_replica is high-cardinality for metrics (lines 163–171)
original_cpln_replica is valuable for traces where high cardinality is fine. As a metric dimension it will cause a cardinality explosion (one time series per replica). The Safety Notes warn about high-cardinality labels, but don't explicitly call out replica as an example. Recommend adding a note that original_cpln_replica (and original_cpln_image/original_commit_hash) should be kept in trace/log attributes only and excluded from metric dimensions.

Minor

5. Inconsistent env var prefix in Collector Workload section (lines 195–199)
The Rails section correctly uses the standard OTEL_ prefix. The Collector Workload section introduces OPEN_TELEMETRY_COLLECTOR_RECEIVER_ENDPOINT, OPEN_TELEMETRY_CONFIG, and OPEN_TELEMETRY_CONFIG_HASH. Readers may wonder whether these are standard OTEL vars. A brief note clarifying these are custom/application-defined vars (not OTEL-standard) would reduce confusion.

6. Rollout step 1 wording (line 311)
"behind ENABLE_OPEN_TELEMETRY=false" implies explicitly setting the variable to false. Since the guard is if ENV["ENABLE_OPEN_TELEMETRY"] == "true", the safe default is simply not setting the var. Suggest: "with ENABLE_OPEN_TELEMETRY unset (disabled by default)".

Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md
@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review summary

This is a docs-only PR adding docs/grafana-opentelemetry.md (a 632-line Grafana/OpenTelemetry guide for Rails apps on Control Plane), plus nav-link updates to README.md and docs/tips.md. No runtime, CLI, or infra code is touched. TOC renumbering in docs/tips.md and all cross-doc links/anchors checked out fine — those aren't broken.

Left 5 inline comments, most-severe first:

  1. docs/grafana-opentelemetry.md:347 — Likely mischaracterizes OTTL error_mode: ignore as producing "no error logged"; that description matches the separate silent mode. Worth confirming against current opentelemetry-collector-contrib docs before merging, since it also underpins the "keep propagate as default" recommendation a few lines down.
  2. docs/grafana-opentelemetry.md:183 — The "Recommended app workload env" snippet uses <app-name> where the rest of this doc (line 255) and the sibling docs/telemetry/application-instrumentation.md:21 use {{APP_NAME}}, the real cpflow apply-template substitution token. Copy-pasting this snippet into a template would ship a literal, non-resolving hostname.
  3. docs/grafana-opentelemetry.md:353 — Framing implies error_mode: propagate causes something like a collector restart/crash; it actually drops/logs the affected data without restarting the process. Could lead someone to alert on the wrong signal.
  4. docs/grafana-opentelemetry.md:323-324 — Very specific version/PR citation for IsRootSpan() (v0.105.0, contrib #32918) that I couldn't verify — please double-check before merging.
  5. docs/grafana-opentelemetry.md:221 — "Collector Workload" section duplicates the canonical template already in docs/telemetry/collector.md (which also covers identity/policy setup this section omits) without linking to it — risk of drift between the two docs over time.

Nothing else surfaced with enough confidence to report — TOC/link integrity, port/pipeline wiring, and secret-reference syntax all check out.

Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md
@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review summary

This PR is docs-only: it adds docs/grafana-opentelemetry.md (638 lines, a Rails/Grafana companion guide to the existing docs/telemetry/ docs) and links it from README.md and docs/tips.md. The doc is thorough and generally well cross-referenced against the canonical telemetry docs. I focused verification on technical accuracy of the config examples against the existing docs/telemetry/collector.md and docs/telemetry/application-instrumentation.md, since those are presented as the canonical templates this guide builds on.

Findings (posted inline, most severe first):

  1. docs/grafana-opentelemetry.md:258 (and :431) — uses bare ${VAR} env-var substitution syntax in the collector config example, but the canonical docs/telemetry/collector.md (pinning collector-contrib 0.155.0) consistently requires ${env:VAR}. Copied literally, the receiver endpoint won't be substituted and the OTLP receiver won't bind correctly.
  2. docs/grafana-opentelemetry.md:176 (and the collector's env block at ~254-262) — "Recommended env" is written as a flat KEY: "value" mapping, but the real Control Plane env schema (per application-instrumentation.md and collector.md) is a list of {name:, value:} objects. Pasted directly into a template, this is invalid schema.
  3. docs/grafana-opentelemetry.md:385status.code is listed under the spanmetrics connector's dimensions: (sourced from span attributes), but span status is a separate field, not an attribute — this dimension likely won't populate, which would silently break the error-rate dashboard panel the doc builds on top of it.
  4. docs/grafana-opentelemetry.md:478 — sampling is mentioned exactly once in the whole doc, as a prose aside after the full pipeline is already presented as deployable, and never appears in the "Minimum collector config components" list or the production readiness checklist. Following the doc's own checklists literally could ship a 100%-sampled pipeline to production.

Not inline, worth a look: per AGENTS.md's changelog policy ("CHANGELOG.md — Keep-a-Changelog; user-visible changes only"), and the precedent set by the predecessor telemetry-docs PR (#369, which did add a CHANGELOG entry for the same class of change), this PR adds a new README/tips-linked doc page but has no CHANGELOG.md entry.

Nothing else survived verification — links/anchors, TOC renumbering, Ruby initializer code, cpln://secret/... syntax, and OTTL statements all checked out correct.

Comment thread docs/grafana-opentelemetry.md Outdated
Comment thread docs/grafana-opentelemetry.md Outdated
@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review summary

This is a well-written, technically solid docs-only addition (docs/grafana-opentelemetry.md, ~650 lines, plus TOC/link updates in README.md and docs/tips.md). A dedicated technical-accuracy pass over the Ruby/OpenTelemetry SDK code, YAML collector configs, and OTTL snippets found no factual errors — API usage, gem names, pipeline wiring (spanmetrics as traces-exporter/metrics-receiver), the error_mode/filter-ordering explanation, and the 4317-vs-4318 port distinction all check out. TOC renumbering in docs/tips.md is internally consistent and no anchors break.

Two minor polish items (posted as inline comments), both low-severity/PLAUSIBLE on verification — nothing blocking:

  1. docs/grafana-opentelemetry.md:244-246 — the "Recommended firewall" bullet list (external inbound: none) doesn't map onto the actual firewallConfig schema documented in docs/telemetry/collector.md (no inbound key exists under external). Likely just needs rewording to match the "external ingress" terminology used elsewhere in the same doc.
  2. docs/grafana-opentelemetry.md:266 — introduces TELEMETRY_COLLECTOR_RECEIVER_ENDPOINT with ${env:...} substitution for the OTLP receiver endpoint, while docs/telemetry/collector.md (this doc's stated "canonical template") hardcodes the same value literally. Both resolve to the same value, so nothing breaks, but worth a one-line note that the pattern is an intentional variant.

Note on CHANGELOG.md: this PR doesn't update it. AGENTS.md scopes the changelog to "user-visible changes," and a comparable prior doc addition (the telemetry docs, PR #369) did get an entry — but another comparable docs-only PR (#347) explicitly skipped it with the stated reasoning "the changelog is scoped to source code." Given that mixed precedent, this is a judgment call for the maintainer rather than a clear-cut convention violation — flagging for awareness, not requesting a change.

Also worth a quick maintainer read (not filed as findings since they're subjective/architectural, not bugs): a couple of independent review passes both flagged that this doc restates a fair amount of content already owned by docs/telemetry/collector.md and docs/telemetry/application-instrumentation.md (env var semantics, secret URI syntax, rollout ordering, port lists) rather than purely linking to it — worth a skim to decide if any of that should be trimmed in favor of the canonical docs, since two copies of the same guidance can drift apart over time.

Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md Outdated
@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review summary

Docs-only PR adding docs/grafana-opentelemetry.md (a Grafana/OpenTelemetry guide for Rails apps on Control Plane), plus navigation links from README.md and docs/tips.md. TOC renumbering, internal anchors, and all cross-file links (/docs/telemetry/*.md, /docs/secrets-and-env-values.md, cpflow apply-template/cpflow ps:restart) were checked and are correct.

Three issues worth a look (inline comments on each):

  1. docs/grafana-opentelemetry.md:431-466 — The "combined" collector config example omits the top-level processors: block entirely, and the batch processor (listed as a required component and referenced twice in service.pipelines) is never defined anywhere in the doc, unlike every other named component. Following this example literally produces a config the collector will refuse to start.
  2. docs/grafana-opentelemetry.md — No CHANGELOG.md entry, despite AGENTS.md's changelog rule and local precedent: the companion telemetry-docs PR ([codex] Document telemetry setup #369) has an [Unreleased] > Added entry for the same kind of change.
  3. docs/grafana-opentelemetry.md:242-246 — The "Recommended firewall" bullets restate docs/telemetry/collector.md's firewallConfig policy (already given there with a working YAML example) without adding anything Rails/Grafana-specific; worth linking instead of duplicating to avoid drift.

Everything else — TOC numbering, anchors, sibling-doc links, CLI command references — checked out.

Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md
@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review summary

This is a well-written, docs-only PR (new docs/grafana-opentelemetry.md, plus nav updates in README.md/docs/tips.md and a CHANGELOG.md entry). Nav links, TOC numbering, and cross-references to existing docs/telemetry/* docs all check out. Two concrete issues found in the collector-config examples (left as inline comments):

  1. Missing connectors: block in the "complete" wiring example (lines 424-482) — the doc claims this snippet includes all top-level component stubs referenced by service.pipelines, but the spanmetrics/http_root_span_latency connector is referenced as both exporter and receiver without being defined in that block. A reader copying it verbatim gets a collector that fails to start.
  2. memory_limiter processor omitted from the recommended pipeline, despite docs/telemetry/collector.md treating it as load-bearing and this guide's own text warning that full-fidelity span export before filtering can overload the collector — a real OOM-risk gap.

Minor (not blocking): the transform/normalize processor block is duplicated verbatim (once standalone, once inside the full pipeline example) — a single source of truth or an explicit "same as above" note would avoid drift if it's edited later (e.g. when swapping IsRootSpan() for the older-collector fallback).

Everything else — Ruby initializer syntax, env var names, port/protocol consistency, OTTL syntax, secret reference syntax, rollout-order numbering — checked out correctly.

Comment thread docs/grafana-opentelemetry.md
Comment thread docs/grafana-opentelemetry.md
@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review summary

This is a docs-only PR adding docs/grafana-opentelemetry.md (a Rails/Grafana companion guide to the existing docs/telemetry/* docs), plus small nav/index updates in README.md, docs/tips.md, and a CHANGELOG.md entry. No application code, CI, or runtime config changes. Link targets, anchors, and the docs/tips.md TOC renumbering all check out correctly, and the security/efficiency guidance in the new doc (secret handling, db_statement: :obfuscate, cardinality warnings, error_mode: propagate) is solid.

Two things worth fixing before merge (left as inline comments):

  1. Broken consolidated collector config example (docs/grafana-opentelemetry.md ~L436-513) — the "complete" YAML example references memory_limiter in both pipelines (service.pipelines.traces.processors / service.pipelines.metrics.processors) but never defines it in that code block's processors: section (it's only defined in an earlier, separate snippet). A reader who copies this example as-is gets a collector that fails to start.
  2. Verbatim-duplicated YAML blocks (~L322-337, ~L355-361, ~L391-413 vs. ~L443-482) — the transform/normalize, filter/non_root_spans, and spanmetrics connector blocks are each shown standalone and then repeated byte-for-byte in the final consolidated example, creating drift risk if only one copy is edited later. Not blocking, just a maintenance note.

Everything else (renumbered TOC, changelog format, cross-file links, Ruby initializer snippet, security/secret handling) looked correct.

@claude

claude Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Review: docs/grafana-opentelemetry.md (docs-only PR)

Reviewed the diff (CHANGELOG.md, README.md, docs/grafana-opentelemetry.md, docs/tips.md) with a focus on correctness, security, and consistency with the existing docs/telemetry/* guide from PR #369.

Overall: This is a low-risk, docs-only change. The guide is well cross-linked (all /docs/... and #anchor links resolve to real files/headings), the YAML/config examples are internally consistent (processor ordering, port numbers, pipeline wiring all check out), and the Ruby/OTel SDK snippets use correct gem require paths and API calls. No security issues — the guide actively recommends closed external ingress, non-production rollout first, secret-scoped collector identities, and keeping high-cardinality/secret data out of dashboards.

A couple of very minor nits, non-blocking:

  1. CHANGELOG wording is slightly ahead of the content. The new entry (CHANGELOG.md line 18) says the guide covers "generated span and log metrics," but the guide only ships a concrete, complete collector config for span-derived metrics (the spanmetrics connector pipeline). Log-to-metrics conversion is mentioned narratively ("custom metrics generated from common ... log patterns", "log patterns converted into metrics") but has no worked example/pipeline snippet like the trace one does. Consider softening the CHANGELOG line to "span metrics" (matching what's actually demonstrated), or add a short log-metrics example if that's still intended.

  2. zpages port (55679) isn't wired into the referenced workload template. The guide's "Recommended ports" list (L266-270) adds 55679 for zpages/debug on top of the docs/telemetry/collector.md template, but that template's example ports: list (and firewall example) only shows 4318, 9127, and 9292 — a reader copying the referenced YAML as-is won't actually expose the zpages port. Since this guide explicitly says it layers on top of that template, it might be worth a one-line callout to add the port entry there, similar to how the doc already calls out layering guidance elsewhere.

Nothing else stood out as incorrect — the many rounds of prior review (OTLP port default, IsRootSpan() version floor, processor ordering, spanmetrics dimensions) all look correctly resolved in the current version.

@justin808 justin808 merged commit 7c96df9 into main Jul 1, 2026
8 checks passed
@justin808 justin808 deleted the docs/grafana-opentelemetry branch July 1, 2026 06:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant