Skip to content

fix(job-logs-collector): Fix routing to otel-router#2599

Open
wj-e2b wants to merge 1 commit intomainfrom
wj-vector-routing
Open

fix(job-logs-collector): Fix routing to otel-router#2599
wj-e2b wants to merge 1 commit intomainfrom
wj-vector-routing

Conversation

@wj-e2b
Copy link
Copy Markdown
Contributor

@wj-e2b wj-e2b commented May 8, 2026

vector can't actually convert to otel log format

vector can't actually convert to otel log format
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

❌ 12 Tests Failed:

Tests completed Failed Passed Skipped
2594 12 2582 7
View the full list of 12 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 67.05% (Passed 57 times, Failed 116 times)

Stack Traces | 2.5s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (2.50s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 73.36% (Passed 61 times, Failed 168 times)

Stack Traces | 38s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
Executing command curl in sandbox ike3cqmagqd926a359uxo
--- FAIL: TestUpdateNetworkConfig (38.03s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 73.54% (Passed 59 times, Failed 164 times)

Stack Traces | 3.57s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox ionetsuqrg5enml7p8zpo
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1358}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox ionetsuqrg5enml7p8zpo
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1359}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox il7yzzq0u6cz3132cdt3q
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1360}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Fri, 08 May 2026 18:25:53 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ionetsuqrg5enml7p8zpo
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (3.57s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildENV

Flake rate in main: 52.59% (Passed 55 times, Failed 61 times)

Stack Traces | 0s run time
=== RUN   TestTemplateBuildENV
=== PAUSE TestTemplateBuildENV
=== CONT  TestTemplateBuildENV
--- FAIL: TestTemplateBuildENV (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/templates::TestTemplateBuildENV/ENV_with_multiline_value

Flake rate in main: 54.72% (Passed 48 times, Failed 58 times)

Stack Traces | 24.6s run time
=== RUN   TestTemplateBuildENV/ENV_with_multiline_value
=== PAUSE TestTemplateBuildENV/ENV_with_multiline_value
=== CONT  TestTemplateBuildENV/ENV_with_multiline_value
    build_template_test.go:134: test-ubuntu-env-multiline: [info] Building template diogxt3s7l0tuezpzdvu/643a6f06-be15-4f47-98b2-befa3299517f
    build_template_test.go:134: test-ubuntu-env-multiline: [info] CACHED [base] FROM ubuntu:22.04 [ffd709f131f42dfab282de47a91dd2c139e900c1c11fc574b49b517a05ef0a32]
    build_template_test.go:134: test-ubuntu-env-multiline: [info] CACHED [base] DEFAULT USER user [90bdd4afa342293c931373351bf578872dec9179214ba3e8bf9edba311466213]
    build_template_test.go:134: test-ubuntu-env-multiline: [info] [builder 1/2] ENV MULTILINE line1
        line2
        line3 [e93da3f3765f20eb6407c336b9e4e0b9321d994ec5f6cb547743a2a4070eed23]
    build_template_test.go:134: test-ubuntu-env-multiline: [info] [builder 2/2] RUN [[ $(echo "$MULTILINE" | wc -l) -eq 3 ]] || exit 1 [477610d61cdf858776262d3331809539bcbcf16f706aac18515a57337bae1786]
    build_template_test.go:134: test-ubuntu-env-multiline: [error] Build failed: failed to run command '[[ $(echo "$MULTILINE" | wc -l) -eq 3 ]] || exit 1': exit status 1
    build_template_test.go:374: Build failed: {<nil> failed to run command '[[ $(echo "$MULTILINE" | wc -l) -eq 3 ]] || exit 1': exit status 1 0xc00070e690}
--- FAIL: TestTemplateBuildENV/ENV_with_multiline_value (24.59s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 53.40% (Passed 96 times, Failed 110 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 58.96% (Passed 55 times, Failed 79 times)

Stack Traces | 8.06s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
Executing command python in sandbox iz2f0d5gks1b3fqby9wit
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (8.06s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_127_0_0_1

Flake rate in main: 53.72% (Passed 56 times, Failed 65 times)

Stack Traces | 6.87s run time
=== RUN   TestBindLocalhost/bind_127_0_0_1
=== PAUSE TestBindLocalhost/bind_127_0_0_1
=== CONT  TestBindLocalhost/bind_127_0_0_1
Executing command python in sandbox ircci3ax6kg9l2ed5ix2i
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_127_0_0_1
        	Messages:   	Unexpected status code 502 for bind address 127.0.0.1
--- FAIL: TestBindLocalhost/bind_127_0_0_1 (6.87s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 60.43% (Passed 55 times, Failed 84 times)

Stack Traces | 7.38s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox iko37v5vuu9qhg2sizfbj
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (7.38s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 60.43% (Passed 55 times, Failed 84 times)

Stack Traces | 7.88s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox ixk0d08epovufydg5f7eo
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (7.88s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 61.31% (Passed 65 times, Failed 103 times)

Stack Traces | 72.2s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (72.21s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 63.82% (Passed 55 times, Failed 97 times)

Stack Traces | 36.2s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1251}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 184 MB\nFree memory before tmpfs mount: 800 MB\nMemory to use in integrity test (80% of free, min 64MB): 640 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"640+0 records in\n640+0 records out\n671088640 bytes (671 MB, 640 MiB) copied, 3.57653 s, 188 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=640\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.55\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.58\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2632\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 3\n\tMinor (reclaiming a frame) page faults: 343\n\tVoluntary context switches: 4\n\tInvoluntary context switches: 16\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 831 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox iq8j0xrxzv6fvw7v2c99n
Executing command bash in sandbox iq8j0xrxzv6fvw7v2c99n (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1268}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"1234e3acdbe48774f5c11822a0863d18b1ee992f991b60db9795dd48e4785e7f\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox iq8j0xrxzv6fvw7v2c99n
Executing command bash in sandbox iq8j0xrxzv6fvw7v2c99n (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1271}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox iq8j0xrxzv6fvw7v2c99n: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (36.20s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The change from a Consul service address to 127.0.0.1 will break log routing in Nomad environments using bridged networking unless the otel-router is running as a sidecar in the same task group. Setting retry_attempts to 0 causes immediate log loss during transient network issues or service restarts, so increasing the number of retries is necessary to improve the reliability of log delivery.

type = "opentelemetry"
type = "http"
inputs = [ "remove_internal" ]
uri = "http://127.0.0.1:${otel_router_http_port}/logs"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The change from a Consul service address to 127.0.0.1 will break log routing in Nomad environments using bridged networking unless the otel-router is running as a sidecar in the same task group. If the router is a separate service, it should continue to use the Consul discovery address to ensure connectivity across the cluster regardless of the network configuration.

uri = "http://otel-router.service.consul:${otel_router_http_port}/logs"


[sinks.otel_router_non_internal.protocol.request]
[sinks.otel_router_non_internal.request]
retry_attempts = 0
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Setting retry_attempts to 0 causes immediate log loss during transient network issues or service restarts. Increasing the number of retries would improve the reliability of the log delivery to the otel-router.

retry_attempts = 3

Copy link
Copy Markdown
Contributor Author

@wj-e2b wj-e2b May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be any network issues over localhost. We have bigger problems if there are and retries are not gonna help. This saves vector from having to buffer any logs.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Straightforward IaC fix: switches the vector otel-router sink from the broken opentelemetry type to http+json on a local port, with mechanical plumbing of a new port variable through both AWS and GCP stacks. Whole sink stays gated behind an off-by-default feature flag.

Extended reasoning...

Overview

The PR fixes routing from the logs-collector Vector job to the otel-router. Vector's opentelemetry sink can't actually emit OTLP log format here, so the sink is rewritten as a plain http sink (JSON codec, newline-delimited framing) pointing at http://127.0.0.1:${otel_router_http_port}/logs on the local node. A new otel_router_http_port variable (default 4321) is plumbed through both provider stacks (AWS: variables.tfmain.tfnomad/variables.tfnomad/main.tf → module; same for GCP), and the two Makefiles expose ENABLE_OTEL_ROUTER_LOGS / OTEL_ROUTER_HTTP_PORT as TF env-var bindings.

Security risks

None meaningful. The sink only listens on loopback (127.0.0.1), so customer logs aren't exposed off-host by this change. There's no auth/crypto/permissions surface affected. The whole otel_router_non_internal sink is gated behind enable_otel_router_logs, which defaults to false, so existing deployments are unaffected unless they opt in.

Level of scrutiny

Low. This is an IaC-only change with no application logic, gated behind an off-by-default flag, and the modifications are highly symmetric across AWS/GCP — the same five-file plumbing pattern repeated. The vector.toml change uses well-documented http sink fields (type, uri, method, encoding.codec, framing.method, buffer.*, healthcheck.enabled, request.retry_attempts, request.timeout_secs).

Other factors

The bug-hunting system found no issues. CODEOWNERS only has a wildcard rule (no path-specific owners that would imply security-sensitive ownership). The change is self-contained and easily revertible if the local port assumption turns out to be wrong in a given environment — operators can override via the new variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants