fix(job-logs-collector): Fix routing to otel-router#2599
Conversation
vector can't actually convert to otel log format
❌ 12 Tests Failed:
View the full list of 12 ❄️ flaky test(s)
To view more test analytics, go to the Test Analytics Dashboard |
There was a problem hiding this comment.
Code Review
The change from a Consul service address to 127.0.0.1 will break log routing in Nomad environments using bridged networking unless the otel-router is running as a sidecar in the same task group. Setting retry_attempts to 0 causes immediate log loss during transient network issues or service restarts, so increasing the number of retries is necessary to improve the reliability of log delivery.
| type = "opentelemetry" | ||
| type = "http" | ||
| inputs = [ "remove_internal" ] | ||
| uri = "http://127.0.0.1:${otel_router_http_port}/logs" |
There was a problem hiding this comment.
The change from a Consul service address to 127.0.0.1 will break log routing in Nomad environments using bridged networking unless the otel-router is running as a sidecar in the same task group. If the router is a separate service, it should continue to use the Consul discovery address to ensure connectivity across the cluster regardless of the network configuration.
uri = "http://otel-router.service.consul:${otel_router_http_port}/logs"
|
|
||
| [sinks.otel_router_non_internal.protocol.request] | ||
| [sinks.otel_router_non_internal.request] | ||
| retry_attempts = 0 |
There was a problem hiding this comment.
There shouldn't be any network issues over localhost. We have bigger problems if there are and retries are not gonna help. This saves vector from having to buffer any logs.
There was a problem hiding this comment.
Straightforward IaC fix: switches the vector otel-router sink from the broken opentelemetry type to http+json on a local port, with mechanical plumbing of a new port variable through both AWS and GCP stacks. Whole sink stays gated behind an off-by-default feature flag.
Extended reasoning...
Overview
The PR fixes routing from the logs-collector Vector job to the otel-router. Vector's opentelemetry sink can't actually emit OTLP log format here, so the sink is rewritten as a plain http sink (JSON codec, newline-delimited framing) pointing at http://127.0.0.1:${otel_router_http_port}/logs on the local node. A new otel_router_http_port variable (default 4321) is plumbed through both provider stacks (AWS: variables.tf → main.tf → nomad/variables.tf → nomad/main.tf → module; same for GCP), and the two Makefiles expose ENABLE_OTEL_ROUTER_LOGS / OTEL_ROUTER_HTTP_PORT as TF env-var bindings.
Security risks
None meaningful. The sink only listens on loopback (127.0.0.1), so customer logs aren't exposed off-host by this change. There's no auth/crypto/permissions surface affected. The whole otel_router_non_internal sink is gated behind enable_otel_router_logs, which defaults to false, so existing deployments are unaffected unless they opt in.
Level of scrutiny
Low. This is an IaC-only change with no application logic, gated behind an off-by-default flag, and the modifications are highly symmetric across AWS/GCP — the same five-file plumbing pattern repeated. The vector.toml change uses well-documented http sink fields (type, uri, method, encoding.codec, framing.method, buffer.*, healthcheck.enabled, request.retry_attempts, request.timeout_secs).
Other factors
The bug-hunting system found no issues. CODEOWNERS only has a wildcard rule (no path-specific owners that would imply security-sensitive ownership). The change is self-contained and easily revertible if the local port assumption turns out to be wrong in a given environment — operators can override via the new variable.
vector can't actually convert to otel log format