Skip to content

feat(api): add Kubernetes service discovery for orchestrator and temp…#2602

Merged
tvi merged 2 commits into
mainfrom
t/kk
May 10, 2026
Merged

feat(api): add Kubernetes service discovery for orchestrator and temp…#2602
tvi merged 2 commits into
mainfrom
t/kk

Conversation

@tvi
Copy link
Copy Markdown
Contributor

@tvi tvi commented May 9, 2026

…late-manager

Adds Kubernetes implementations of the Discovery interface (introduced in the previous commit) and wires them in via a new config switch.

  • cfg.ServiceDiscoveryProvider ('nomad' default, or 'kubernetes'), plus K8sNamespace and label-selector env vars for the orchestrator and template-manager pods
  • new orchestrator/discovery.KubernetesDiscovery and clusters/discovery.KubernetesServiceDiscovery; both list pods by label selector in the api's namespace, filter to Running+Ready, and use status.HostIP (orchestrator/template-manager pods run with host_network=true) as the address. Pod UID is the unique identifier; pod name truncated to NodeIDLength is the short-id key. Discovered nodes are addressed on consts.OrchestratorAPIPort.
  • handlers/store.go selects the implementation at startup; when PROVIDER=kubernetes both discoveries share an in-cluster K8s client built from rest.InClusterConfig

Behavior preserved: with the default SERVICE_DISCOVERY_PROVIDER=nomad the api still talks to the local Nomad agent exactly as before, so the existing Nomad deploy is unaffected.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

❌ 12 Tests Failed:

Tests completed Failed Passed Skipped
2609 12 2597 7
View the full list of 14 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateList

Flake rate in main: 49.68% (Passed 78 times, Failed 77 times)

Stack Traces | 0s run time
=== RUN   TestSnapshotTemplateList
=== PAUSE TestSnapshotTemplateList
=== CONT  TestSnapshotTemplateList
--- FAIL: TestSnapshotTemplateList (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID

Flake rate in main: 49.68% (Passed 78 times, Failed 77 times)

Stack Traces | 16.1s run time
=== RUN   TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID
=== PAUSE TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID
=== CONT  TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID
    snapshot_template_test.go:146: 
        	Error Trace:	.../api/sandboxes/snapshot_template_test.go:37
        	            				.../api/sandboxes/snapshot_template_test.go:146
        	Error:      	Not equal: 
        	            	expected: 201
        	            	actual  : 500
        	Test:       	TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID
--- FAIL: TestSnapshotTemplateList/list_snapshots_filtered_by_sandbox_ID (16.14s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 73.35% (Passed 85 times, Failed 234 times)

Stack Traces | 178s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (177.80s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 73.79% (Passed 81 times, Failed 228 times)

Stack Traces | 2.65s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1366}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
Executing command curl in sandbox inzhoefwriqcbxkomorvw
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1367}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1368}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Sat, 09 May 2026 23:16:18 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox inzhoefwriqcbxkomorvw
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (2.65s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 53.22% (Passed 138 times, Failed 157 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 58.60% (Passed 77 times, Failed 109 times)

Stack Traces | 8.12s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1259}}
Executing command python in sandbox il4boh667gvqkegxyxgbm
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (8.12s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir

Flake rate in main: 49.74% (Passed 95 times, Failed 94 times)

Stack Traces | 1.56s run time
=== RUN   TestListDir
=== PAUSE TestListDir
=== CONT  TestListDir
--- FAIL: TestListDir (1.56s)
Executing command python in sandbox izhta2muubnmnz8cr5qyr
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_0_lists_only_root_directory

Flake rate in main: 52.47% (Passed 77 times, Failed 85 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_0_lists_only_root_directory
=== PAUSE TestListDir/depth_0_lists_only_root_directory
=== CONT  TestListDir/depth_0_lists_only_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_0_lists_only_root_directory
--- FAIL: TestListDir/depth_0_lists_only_root_directory (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_1_lists_root_directory

Flake rate in main: 52.47% (Passed 77 times, Failed 85 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_1_lists_root_directory
=== PAUSE TestListDir/depth_1_lists_root_directory
=== CONT  TestListDir/depth_1_lists_root_directory
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_1_lists_root_directory
--- FAIL: TestListDir/depth_1_lists_root_directory (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)

Flake rate in main: 52.47% (Passed 77 times, Failed 85 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== PAUSE TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
=== CONT  TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory)
--- FAIL: TestListDir/depth_2_lists_first_level_of_subdirectories_(in_this_case_the_root_directory) (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestListDir/depth_3_lists_all_directories_and_files

Flake rate in main: 52.47% (Passed 77 times, Failed 85 times)

Stack Traces | 0.01s run time
=== RUN   TestListDir/depth_3_lists_all_directories_and_files
=== PAUSE TestListDir/depth_3_lists_all_directories_and_files
=== CONT  TestListDir/depth_3_lists_all_directories_and_files
    filesystem_test.go:97: 
        	Error Trace:	.../tests/envd/filesystem_test.go:97
        	Error:      	Received unexpected error:
        	            	unavailable: 502 Bad Gateway
        	Test:       	TestListDir/depth_3_lists_all_directories_and_files
--- FAIL: TestListDir/depth_3_lists_all_directories_and_files (0.01s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 61.84% (Passed 87 times, Failed 141 times)

Stack Traces | 83.8s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (83.79s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 63.68% (Passed 77 times, Failed 135 times)

Stack Traces | 46.2s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1268}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 185 MB\nFree memory before tmpfs mount: 799 MB\nMemory to use in integrity test (80% of free, min 64MB): 639 MB\n"}}
Executing command bash in sandbox i6ygjz9mntiej8zuxznd8 (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"639+0 records in\n639+0 records out\n670040064 bytes (670 MB, 639 MiB) copied, 17.4008 s, 38.5 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=639\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 17.03\n\tPerc"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ent of CPU this job got: 97%\n\tElapsed (wall clock) time (h:mm:ss or m:ss)"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:": 0:17.46\n\tAverage "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tAverage total size (kbytes):"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" 0\n\tMaximum res"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ident set size"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" (kbytes): 2732"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\n\tAverage resi"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"dent set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 3\n\tMinor (reclaiming a frame) page faults: 345\n\tVolu"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"ntary context switches: 4\n\tInvol"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"untary context"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:" switches: 185\n\tSw"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"aps: 0\n\tFile "}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"system inputs: 176\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 832 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox i2gwr3yqax036ccnlwf9u
Executing command bash in sandbox i2gwr3yqax036ccnlwf9u (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1285}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"1f333493dfe3fcf638efdea87c974d40ee93fb2ec739947220419695e16318ef\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox i2gwr3yqax036ccnlwf9u
Executing command bash in sandbox i2gwr3yqax036ccnlwf9u (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1288}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox i2gwr3yqax036ccnlwf9u: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (46.22s)
github.com/e2b-dev/infra/tests/integration/internal/tests/proxies::TestSandboxAutoResumeViaProxy

Flake rate in main: 49.35% (Passed 78 times, Failed 76 times)

Stack Traces | 20.3s run time
=== RUN   TestSandboxAutoResumeViaProxy
=== PAUSE TestSandboxAutoResumeViaProxy
=== CONT  TestSandboxAutoResumeViaProxy
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"idgy4uywk04fxbiy0iviu","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"idgy4uywk04fxbiy0iviu","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:97: [Status code: 502] Response body: {"sandboxId":"idgy4uywk04fxbiy0iviu","message":"The sandbox is running but port is not open","port":8000,"code":502}
    auto_resume_test.go:116: 
        	Error Trace:	.../tests/proxies/auto_resume_test.go:116
        	Error:      	Received unexpected error:
        	            	Get "http://localhost:3002": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestSandboxAutoResumeViaProxy
--- FAIL: TestSandboxAutoResumeViaProxy (20.27s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Comment thread packages/api/internal/cfg/model.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f70a225f1e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/api/internal/orchestrator/discovery/kubernetes.go Outdated
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Truncating Kubernetes pod names to a fixed length will cause ID collisions because pod names in a controller share common prefixes, leading to incorrect routing or state corruption. Using string formatting to join IP addresses and ports fails for IPv6 hosts; net.JoinHostPort should be used to ensure correct formatting. The K8sNamespace configuration field lacks a default value or requirement check, which may cause the client to query all namespaces and fail due to RBAC restrictions.

Comment thread packages/api/internal/orchestrator/discovery/kubernetes.go Outdated
Comment thread packages/api/internal/orchestrator/discovery/kubernetes.go Outdated
Comment thread packages/api/internal/cfg/model.go Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8d7f7d6184

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/api/internal/orchestrator/discovery/kubernetes.go Outdated
@tvi tvi force-pushed the t/kk branch 2 times, most recently from 2cf4f2a to 3fcf006 Compare May 9, 2026 16:49
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: edb90a4eb4

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread packages/api/internal/clusters/discovery/kubernetes.go
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stale comment

Comment thread packages/api/internal/orchestrator/discovery/kubernetes.go
…late-manager

Adds Kubernetes implementations of the Discovery interface (introduced
in the previous commit) and wires them in via a new config switch.

  - cfg.ServiceDiscoveryProvider ('nomad' default, or 'kubernetes'),
    plus K8sNamespace and label-selector env vars for the orchestrator
    and template-manager pods
  - new orchestrator/discovery.KubernetesDiscovery and
    clusters/discovery.KubernetesServiceDiscovery; both list pods by
    label selector in the api's namespace, filter to Running+Ready,
    and use status.HostIP (orchestrator/template-manager pods run with
    host_network=true) as the address. Discovered nodes are addressed
    on consts.OrchestratorAPIPort.
  - the K8s backend uses the full pod name as the discovery ShortID
    rather than truncating to consts.NodeIDLength like the Nomad
    backend does. Nomad node IDs are UUIDs whose first 8 hex chars are
    effectively unique; pod names share a long DaemonSet/Deployment
    prefix and would collide under that truncation, collapsing every
    orchestrator pod into a single discovery key and silently dropping
    all but one. The ShortID is opaque to downstream consumers
    (string-compare key only), so relaxing the width invariant for
    K8s only is safe. The tradeoff is documented inline.
  - handlers/store.go selects the implementation at startup; when
    PROVIDER=kubernetes both discoveries share an in-cluster K8s
    client built from rest.InClusterConfig

Behavior preserved: with the default SERVICE_DISCOVERY_PROVIDER=nomad
the api still talks to the local Nomad agent exactly as before, so the
existing Nomad deploy is unaffected.
@tvi tvi merged commit e730224 into main May 10, 2026
49 checks passed
@tvi tvi deleted the t/kk branch May 10, 2026 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants