Skip to content

Feat/traffic keepalive#2591

Draft
matthewlouisbrockman wants to merge 119 commits intomainfrom
feat/traffic-keepalive
Draft

Feat/traffic keepalive#2591
matthewlouisbrockman wants to merge 119 commits intomainfrom
feat/traffic-keepalive

Conversation

@matthewlouisbrockman
Copy link
Copy Markdown
Contributor

No description provided.

# Conflicts:
#	packages/client-proxy/internal/proxy/paused_sandbox_resumer_grpc.go
Add a shared route IP resolver with the local-cluster fallback needed by CI, and make API/client-proxy callers treat empty resolved routes as unavailable instead of successful resume responses.

This keeps BYOC/remote empty node IPs from being treated as routable while preserving the local 127.0.0.1 path.
@cla-bot cla-bot Bot added the cla-signed label May 7, 2026
@cursor
Copy link
Copy Markdown

cursor Bot commented May 7, 2026

PR Summary

Medium Risk
Touches sandbox lifecycle, routing catalog metadata, and gRPC auth/traffic validation; misconfigurations or missing catalog metadata (e.g., team_id) can prevent keepalive/auto-resume or cause unexpected permission failures.

Overview
This change introduces lifecycle (with traffic keepalive) as the new home for sandbox timeout behavior and wires it through create/resume/get, pause snapshots, and orchestrator create requests, while keeping legacy autoPause/autoResume decoding for compatibility.

Potential issues: keepalive depends on routing-catalog team_id + metadata propagation (older/missing entries silently skip refresh), keepalive refresh runs async and is only throttled via catalog acquire (no release), and the new KeepAliveSandbox/refactored resume auth+token checks add more permission failure paths that may surface as unexpected NotFound/PermissionDenied during proxy traffic.

Reviewed by Cursor Bugbot for commit 088532b. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

❌ 11 Tests Failed:

Tests completed Failed Passed Skipped
2624 11 2613 7
View the top 2 failed test(s) by shortest run time
github.com/e2b-dev/infra/packages/shared/pkg/proxy::TestProxyRoutesToTargetServer
Stack Traces | 0.01s run time
=== RUN   TestProxyRoutesToTargetServer
=== PAUSE TestProxyRoutesToTargetServer
=== CONT  TestProxyRoutesToTargetServer
    proxy_test.go:215: 
        	Error Trace:	.../pkg/proxy/proxy_test.go:215
        	Error:      	Not equal: 
        	            	expected: 0x1
        	            	actual  : 0x2
        	Test:       	TestProxyRoutesToTargetServer
        	Messages:   	backend should have been called once
--- FAIL: TestProxyRoutesToTargetServer (0.01s)
github.com/e2b-dev/infra/packages/shared/pkg/proxy::TestProxyRetriesOnDelayedBackendStartup
Stack Traces | 0.3s run time
=== RUN   TestProxyRetriesOnDelayedBackendStartup
=== PAUSE TestProxyRetriesOnDelayedBackendStartup
=== CONT  TestProxyRetriesOnDelayedBackendStartup
    proxy_test.go:864: 
        	Error Trace:	.../pkg/proxy/proxy_test.go:130
        	            				.../pkg/proxy/proxy_test.go:864
        	Error:      	Not equal: 
        	            	expected: "backend-1"
        	            	actual  : "delayed-backend"
        	            	
        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1 +1 @@
        	            	-backend-1
        	            	+delayed-backend
        	Test:       	TestProxyRetriesOnDelayedBackendStartup
        	Messages:   	backend id should be the same
    proxy_test.go:867: 
        	Error Trace:	.../pkg/proxy/proxy_test.go:867
        	Error:      	"9.261174ms" is not greater than or equal to "300ms"
        	Test:       	TestProxyRetriesOnDelayedBackendStartup
        	Messages:   	request should have waited for backend to start
    proxy_test.go:871: 
        	Error Trace:	.../pkg/proxy/proxy_test.go:871
        	Error:      	Not equal: 
        	            	expected: 0x1
        	            	actual  : 0x0
        	Test:       	TestProxyRetriesOnDelayedBackendStartup
        	Messages:   	backend should have been called once
--- FAIL: TestProxyRetriesOnDelayedBackendStartup (0.30s)
View the full list of 13 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/metrics::TestTeamMetrics

Flake rate in main: 67.05% (Passed 57 times, Failed 116 times)

Stack Traces | 1.37s run time
=== RUN   TestTeamMetrics
=== PAUSE TestTeamMetrics
=== CONT  TestTeamMetrics
    team_metrics_test.go:61: 
        	Error Trace:	.../api/metrics/team_metrics_test.go:61
        	Error:      	Should be true
        	Test:       	TestTeamMetrics
        	Messages:   	MaxConcurrentSandboxes should be >= 0
--- FAIL: TestTeamMetrics (1.37s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateCreateSandbox

Flake rate in main: 49.09% (Passed 56 times, Failed 54 times)

Stack Traces | 0s run time
=== RUN   TestSnapshotTemplateCreateSandbox
=== PAUSE TestSnapshotTemplateCreateSandbox
=== CONT  TestSnapshotTemplateCreateSandbox
--- FAIL: TestSnapshotTemplateCreateSandbox (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create

Flake rate in main: 49.09% (Passed 56 times, Failed 54 times)

Stack Traces | 35.6s run time
=== RUN   TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create
=== PAUSE TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create
=== CONT  TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create
    snapshot_template_test.go:294: Command [/bin/sh] output: event:{start:{pid:1271}}
    snapshot_template_test.go:294: Command [/bin/sh] output: event:{end:{exited:true status:"exit status 0"}}
    snapshot_template_test.go:294: Command [/bin/sh] completed successfully in sandbox iyynk1l4xnmejj2bzpx0n
    snapshot_template_test.go:299: 
        	Error Trace:	.../api/sandboxes/snapshot_template_test.go:299
        	Error:      	Not equal: 
        	            	expected: 201
        	            	actual  : 409
        	Test:       	TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create
--- FAIL: TestSnapshotTemplateCreateSandbox/overwritten_snapshot_build_is_served_immediately_on_sandbox_create (35.57s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateDelete

Flake rate in main: 49.09% (Passed 56 times, Failed 54 times)

Stack Traces | 0s run time
=== RUN   TestSnapshotTemplateDelete
=== PAUSE TestSnapshotTemplateDelete
=== CONT  TestSnapshotTemplateDelete
--- FAIL: TestSnapshotTemplateDelete (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateDelete/delete_named_snapshot_by_name

Flake rate in main: 49.09% (Passed 56 times, Failed 54 times)

Stack Traces | 15.6s run time
=== RUN   TestSnapshotTemplateDelete/delete_named_snapshot_by_name
=== PAUSE TestSnapshotTemplateDelete/delete_named_snapshot_by_name
=== CONT  TestSnapshotTemplateDelete/delete_named_snapshot_by_name
    snapshot_template_test.go:189: 
        	Error Trace:	.../api/sandboxes/snapshot_template_test.go:37
        	            				.../api/sandboxes/snapshot_template_test.go:189
        	Error:      	Not equal: 
        	            	expected: 201
        	            	actual  : 500
        	Test:       	TestSnapshotTemplateDelete/delete_named_snapshot_by_name
--- FAIL: TestSnapshotTemplateDelete/delete_named_snapshot_by_name (15.60s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 73.36% (Passed 61 times, Failed 168 times)

Stack Traces | 38.3s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (38.28s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 73.54% (Passed 59 times, Failed 164 times)

Stack Traces | 3.06s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1346}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
Executing command curl in sandbox ivzmnw2itycmy44xeipmg
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1347}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35 exited:true status:"exit status 35" error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1348}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Fri, 08 May 2026 17:58:53 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox ivzmnw2itycmy44xeipmg
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (3.06s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 53.40% (Passed 96 times, Failed 110 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 58.96% (Passed 55 times, Failed 79 times)

Stack Traces | 7.88s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
Executing command python in sandbox ihmfq8jbq7j50srahzrds
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (7.88s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 60.43% (Passed 55 times, Failed 84 times)

Stack Traces | 8.87s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
Executing command python in sandbox izy9odku7eiujt74s74jt
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (8.87s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 60.43% (Passed 55 times, Failed 84 times)

Stack Traces | 8.99s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox idvlqxocyqcr681lbwxmd
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1266}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
Executing command python in sandbox i3qhnr0p4yhqlmlczcw42
--- FAIL: TestBindLocalhost/bind_localhost (8.99s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 61.31% (Passed 65 times, Failed 103 times)

Stack Traces | 75.8s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (75.81s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 63.82% (Passed 55 times, Failed 97 times)

Stack Traces | 31.6s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1259}}
Executing command bash in sandbox ipnj5ktn391pg6c4pkgqn (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 184 MB\nFree memory before tmpfs mount: 800 MB\nMemory to use in integrity test (80% of free, min 64MB): 640 MB\n"}}
Executing command bash in sandbox ipnj5ktn391pg6c4pkgqn (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"640+0 records in\n640+0 records out\n671088640 bytes (671 MB, 640 MiB) copied, 23.9555 s, 28.0 MB/s"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=640\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 23.52\n\tPercent of CPU this job got: 97%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:24.15\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2644\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 2\n\tMinor (reclaiming a frame) page faults: 346\n\tVoluntary context switches: 3\n\tInvoluntary context switches: 169\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 831 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox i3kj47ytqfopjdsd3u1jf
Executing command bash in sandbox i3kj47ytqfopjdsd3u1jf (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1275}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"15038def14d31d4d6c080bc7f728c3097fa9b32d02ff70e2ff830a1b8f8cb3d5\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox i3kj47ytqfopjdsd3u1jf
    sandbox_memory_integrity_test.go:84: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:84
        	Error:      	Not equal: 
        	            	expected: 204
        	            	actual  : 404
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (31.64s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The metadataFirstValue function returns a boolean indicating if the value was found. Ignoring this boolean for providedToken and providedEnvdToken means that a missing header is treated the same as an empty header. It is more robust and explicit to check if the token was actually provided before attempting to match it, especially for security-sensitive authentication tokens.

Comment thread packages/api/internal/handlers/proxy_grpc.go Outdated
Comment thread packages/api/internal/handlers/proxy_grpc.go Outdated
Comment thread packages/client-proxy/internal/proxy/proxy.go
Comment thread packages/api/internal/orchestrator/catalog/keepalive.go Outdated
}

timeout = uint64(*lifecycle.Keepalive.Traffic.Timeout)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keepalive timeout stored without upper-bound validation

Low Severity

buildKeepaliveConfig validates that the traffic keepalive timeout is greater than the throttle interval but never validates an upper bound. A caller could supply an arbitrarily large timeout (up to int32 max, ~68 years), which would be stored and later used verbatim in KeepAliveFor. While getMaxAllowedTTL caps the actual extension at MaxInstanceLength, the uncapped stored value is misleading and means the TrafficKeepaliveThrottleInterval error message ("must be greater than N seconds") could fire for a zero or negative timeout that should have been caught by the earlier < 0 check — specifically, no check prevents Timeout = 0 being accepted when 0 > TrafficKeepaliveThrottleInterval is false, making a zero timeout silently pass validation but produce a no-op keepalive.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b24dd2d. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 088532b. Configure here.

}

if updatedEndTime.IsZero() || !sbx.EndTime.Equal(updatedEndTime) {
return &sbx, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KeepAliveFor skips update when store returns stale data

Medium Severity

The condition !sbx.EndTime.Equal(updatedEndTime) is intended to detect when the store returned a sandbox state that doesn't reflect the update. However, if the store implementation applies the update but returns a slightly different time (e.g., due to serialization rounding in Redis), or if a concurrent update modifies EndTime between the closure setting updatedEndTime and the returned sbx, this condition would incorrectly skip the UpdateSandbox and addSandboxToRoutingTable calls. The sandbox timeout would be updated in the local store but NOT propagated to the orchestrator node, causing the sandbox to expire prematurely on the node side while the API thinks it's still alive.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 088532b. Configure here.

if sandbox.Lifecycle.Keepalive.Traffic != nil && sandbox.Lifecycle.Keepalive.Traffic.Enabled {
keepalive.Traffic = &sandboxroutingcatalog.TrafficKeepalive{Enabled: true}
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keepalive config stored even when traffic disabled

Low Severity

When Lifecycle.Keepalive is non-nil but Traffic is nil or not enabled, a Keepalive{} struct with nil Traffic is still allocated and stored in the routing catalog. This creates unnecessary routing catalog entries with empty keepalive metadata for sandboxes that explicitly set enabled: false. The outer nil check on Keepalive alone is insufficient — it also needs to verify there's actually something meaningful to store. The same pattern is duplicated in metadata.go.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 088532b. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants