Skip to content

feat(api): singleflight per sandbox, 256-concurrency cap, in-flight gauge#2625

Merged
jakubno merged 3 commits into
mainfrom
evictor-singleflight-limit
May 12, 2026
Merged

feat(api): singleflight per sandbox, 256-concurrency cap, in-flight gauge#2625
jakubno merged 3 commits into
mainfrom
evictor-singleflight-limit

Conversation

@jakubno
Copy link
Copy Markdown
Member

@jakubno jakubno commented May 11, 2026

Dedupe overlapping eviction attempts for the same sandbox via singleflight, cap concurrent evictions at 256 using errgroup.TryGo so the ticker loop never blocks, and expose an observable gauge (api.evictor.evictions.running) for the in-flight count.

…flight gauge

Dedupe overlapping eviction attempts for the same sandbox via
singleflight, cap concurrent evictions at 256 using errgroup.TryGo so
the ticker loop never blocks, and expose an observable gauge
(api.evictor.evictions.running) for the in-flight count.
@cla-bot cla-bot Bot added the cla-signed label May 11, 2026
@cursor
Copy link
Copy Markdown

cursor Bot commented May 11, 2026

PR Summary

Medium Risk
Touches sandbox eviction scheduling and concurrency control; incorrect limits or dedupe keying could delay expirations or skip evictions under load.

Overview
Eviction processing now deduplicates concurrent attempts per SandboxID, limits work to 256 in-flight evictions via errgroup.TryGo (skipping excess until the next tick), and publishes a new observable gauge api.evictor.evictions.running based on the in-memory active-eviction map. evictor.New now requires a metric.Meter and can fail during metric registration, so orchestrator startup handles and surfaces that error.

Reviewed by Cursor Bugbot for commit f56fd8b. Bugbot is set up for automated code reviews on this repo. Configure here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 11, 2026

❌ 6 Tests Failed:

Tests completed Failed Passed Skipped
2611 6 2605 5
View the full list of 9 ❄️ flaky test(s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig

Flake rate in main: 76.65% (Passed 127 times, Failed 417 times)

Stack Traces | 234s run time
=== RUN   TestUpdateNetworkConfig
=== PAUSE TestUpdateNetworkConfig
=== CONT  TestUpdateNetworkConfig
--- FAIL: TestUpdateNetworkConfig (233.78s)
github.com/e2b-dev/infra/tests/integration/internal/tests/api/sandboxes::TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false

Flake rate in main: 77.11% (Passed 122 times, Failed 411 times)

Stack Traces | 3.61s run time
=== RUN   TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
Executing command curl in sandbox i3iy312t4v886ik8yd1k2
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1358}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
Executing command curl in sandbox i3iy312t4v886ik8yd1k2
    sandbox_network_update_test.go:372: Command [curl] output: event:{start:{pid:1359}}
    sandbox_network_update_test.go:372: Command [curl] output: event:{end:{exit_code:35  exited:true  status:"exit status 35"  error:"exit status 35"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{start:{pid:1360}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{data:{stdout:"HTTP/2 302 \r\nx-content-type-options: nosniff\r\nlocation: https://dns.google/\r\ndate: Mon, 11 May 2026 15:24:28 GMT\r\ncontent-type: text/html; charset=UTF-8\r\nserver: HTTP server (unknown)\r\ncontent-length: 216\r\nx-xss-protection: 0\r\nx-frame-options: SAMEORIGIN\r\nalt-svc: h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000\r\n\r\n"}}
    sandbox_network_update_test.go:391: Command [curl] output: event:{end:{exited:true  status:"exit status 0"}}
    sandbox_network_update_test.go:391: Command [curl] completed successfully in sandbox i3iy312t4v886ik8yd1k2
    sandbox_network_update_test.go:391: 
        	Error Trace:	.../api/sandboxes/sandbox_network_out_test.go:74
        	            				.../api/sandboxes/sandbox_network_update_test.go:60
        	            				.../api/sandboxes/sandbox_network_update_test.go:391
        	Error:      	An error is expected but got nil.
        	Test:       	TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false
        	Messages:   	https://8.8.8.8 should be blocked
--- FAIL: TestUpdateNetworkConfig/pause_resume_preserves_allow_internet_access_false (3.61s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost

Flake rate in main: 57.64% (Passed 219 times, Failed 298 times)

Stack Traces | 0s run time
=== RUN   TestBindLocalhost
=== PAUSE TestBindLocalhost
=== CONT  TestBindLocalhost
--- FAIL: TestBindLocalhost (0.00s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_0_0_0_0

Flake rate in main: 64.02% (Passed 118 times, Failed 210 times)

Stack Traces | 7.09s run time
=== RUN   TestBindLocalhost/bind_0_0_0_0
=== PAUSE TestBindLocalhost/bind_0_0_0_0
=== CONT  TestBindLocalhost/bind_0_0_0_0
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1264}}
Executing command python in sandbox i7s34055pfqsefkyki5d5
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_0_0_0_0
        	Messages:   	Unexpected status code 502 for bind address 0.0.0.0
--- FAIL: TestBindLocalhost/bind_0_0_0_0 (7.09s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_127_0_0_1

Flake rate in main: 58.62% (Passed 120 times, Failed 170 times)

Stack Traces | 7.06s run time
=== RUN   TestBindLocalhost/bind_127_0_0_1
=== PAUSE TestBindLocalhost/bind_127_0_0_1
=== CONT  TestBindLocalhost/bind_127_0_0_1
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1258}}
Executing command python in sandbox i4tlb50akshl8v7wya1ig
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_127_0_0_1
        	Messages:   	Unexpected status code 502 for bind address 127.0.0.1
--- FAIL: TestBindLocalhost/bind_127_0_0_1 (7.06s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_::1

Flake rate in main: 65.60% (Passed 118 times, Failed 225 times)

Stack Traces | 7.59s run time
=== RUN   TestBindLocalhost/bind_::1
=== PAUSE TestBindLocalhost/bind_::1
=== CONT  TestBindLocalhost/bind_::1
Executing command python in sandbox iyji6yi23c96r2x3retvu
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1258}}
Executing command update-ca-certificates in sandbox itbwuobcfjow172dl65dq (user: root)
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_::1
        	Messages:   	Unexpected status code 502 for bind address ::1
--- FAIL: TestBindLocalhost/bind_::1 (7.59s)
github.com/e2b-dev/infra/tests/integration/internal/tests/envd::TestBindLocalhost/bind_localhost

Flake rate in main: 65.70% (Passed 118 times, Failed 226 times)

Stack Traces | 7.15s run time
=== RUN   TestBindLocalhost/bind_localhost
=== PAUSE TestBindLocalhost/bind_localhost
=== CONT  TestBindLocalhost/bind_localhost
Executing command python in sandbox i7mrt94yn5obfhmf8v1sn
    localhost_bind_test.go:69: Command [python] output: event:{start:{pid:1258}}
    localhost_bind_test.go:90: 
        	Error Trace:	.../tests/envd/localhost_bind_test.go:90
        	Error:      	Not equal: 
        	            	expected: 200
        	            	actual  : 502
        	Test:       	TestBindLocalhost/bind_localhost
        	Messages:   	Unexpected status code 502 for bind address localhost
--- FAIL: TestBindLocalhost/bind_localhost (7.15s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity

Flake rate in main: 66.40% (Passed 128 times, Failed 253 times)

Stack Traces | 76.7s run time
=== RUN   TestSandboxMemoryIntegrity
=== PAUSE TestSandboxMemoryIntegrity
=== CONT  TestSandboxMemoryIntegrity
    sandbox_memory_integrity_test.go:26: Build completed successfully
--- FAIL: TestSandboxMemoryIntegrity (76.66s)
github.com/e2b-dev/infra/tests/integration/internal/tests/orchestrator::TestSandboxMemoryIntegrity/tmpfs_hash

Flake rate in main: 67.67% (Passed 118 times, Failed 247 times)

Stack Traces | 27.8s run time
=== RUN   TestSandboxMemoryIntegrity/tmpfs_hash
=== PAUSE TestSandboxMemoryIntegrity/tmpfs_hash
=== CONT  TestSandboxMemoryIntegrity/tmpfs_hash
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{start:{pid:1264}}
Executing command bash in sandbox i3wtyuoimkhfahfluymb6 (user: root)
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Total memory: 985 MB\nUsed memory before tmpfs mount: 185 MB\nFree memory before tmpfs mount: 799 MB\nMemory to use in integrity test (80% of free, min 64MB): 639 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"639+0 records in\n639+0 records out\n670040064 bytes (670 MB, 639 MiB) copied, 3.63625 s, 184 MB/s\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stderr:"\tCommand being timed: \"dd if=/dev/urandom of=/mnt/testfile bs=1M count=639\"\n\tUser time (seconds): 0.00\n\tSystem time (seconds): 3.60\n\tPercent of CPU this job got: 99%\n\tElapsed (wall clock) time (h:mm:ss or m:ss): 0:03.64\n\tAverage shared text size (kbytes): 0\n\tAverage unshared data size (kbytes): 0\n\tAverage stack size (kbytes): 0\n\tAverage total size (kbytes): 0\n\tMaximum resident set size (kbytes): 2628\n\tAverage resident set size (kbytes): 0\n\tMajor (requiring I/O) page faults: 2\n\tMinor (reclaiming a frame) page faults: 344\n\tVoluntary context switches: 3\n\tInvoluntary context switches: 85\n\tSwaps: 0\n\tFile system inputs: 176\n\tFile system outputs: 0\n\tSocket messages sent: 0\n\tSocket messages received: 0\n\tSignals delivered: 0\n\tPage size (bytes): 4096\n\tExit status: 0\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{data:{stdout:"Used memory after tmpfs mount and file fill: 831 MB\n"}}
    sandbox_memory_integrity_test.go:70: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:70: Command [bash] completed successfully in sandbox ibsthfogwb207obup59ow
Executing command bash in sandbox ibsthfogwb207obup59ow (user: root)
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{start:{pid:1280}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{data:{stdout:"c7b16e0a3176350f7400a7e3bae64edd68fba8498c374e826488a71b6acf9663\n"}}
    sandbox_memory_integrity_test.go:74: Command [bash] output: event:{end:{exited:true status:"exit status 0"}}
    sandbox_memory_integrity_test.go:74: Command [bash] completed successfully in sandbox ibsthfogwb207obup59ow
Executing command bash in sandbox ibsthfogwb207obup59ow (user: root)
    sandbox_memory_integrity_test.go:99: Command [bash] output: event:{start:{pid:1283}}
    sandbox_memory_integrity_test.go:100: 
        	Error Trace:	.../tests/orchestrator/sandbox_memory_integrity_test.go:100
        	Error:      	Received unexpected error:
        	            	failed to execute command bash in sandbox ibsthfogwb207obup59ow: invalid_argument: protocol error: incomplete envelope: unexpected EOF
        	Test:       	TestSandboxMemoryIntegrity/tmpfs_hash
--- FAIL: TestSandboxMemoryIntegrity/tmpfs_hash (27.84s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The use of singleflight.Do inside the errgroup goroutine can lead to starvation because it blocks while waiting for the same sandbox ID, potentially filling all 256 slots with redundant goroutines and preventing the evictor from processing other sandboxes. Replacing singleflight with a sync.Map to track active sandbox IDs and skipping the TryGo call if an eviction is already in progress avoids this blocking and ensures slots remain available for other tasks.

Comment thread packages/api/internal/orchestrator/evictor/evict.go
Comment thread packages/api/internal/orchestrator/evictor/evict.go Outdated
Comment thread packages/api/internal/orchestrator/evictor/evict.go
…arvation

singleflight.Do blocked inside the errgroup goroutine, so duplicate
ticks for the same slow-evicting sandbox piled up as waiters and could
exhaust the 256 concurrency cap. Switch to a sync.Map LoadOrStore check
before TryGo so duplicates short-circuit without consuming a slot, and
derive the in-flight gauge from the map directly.
@jakubno jakubno changed the title feat(api/evictor): singleflight per sandbox, 256-concurrency cap, in-flight gauge feat(api): singleflight per sandbox, 256-concurrency cap, in-flight gauge May 11, 2026
@jakubno jakubno marked this pull request as ready for review May 11, 2026 15:19
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Code review skipped — your organization has reached its monthly code review spending cap.

An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.

Once the cap resets or is raised, reopen this pull request to trigger a review.

@jakubno jakubno merged commit f2c360d into main May 12, 2026
52 checks passed
@jakubno jakubno deleted the evictor-singleflight-limit branch May 12, 2026 16:18
ValentaTomas pushed a commit that referenced this pull request May 13, 2026
…auge (#2625)

Dedupe overlapping eviction attempts for the same sandbox via
`singleflight`, cap concurrent evictions at **256** using
`errgroup.TryGo` so the ticker loop never blocks, and expose an
observable gauge (api.evictor.evictions.running) for the in-flight
count.
AdaAibaby pushed a commit to AdaAibaby/infra that referenced this pull request May 14, 2026
…auge (e2b-dev#2625)

Dedupe overlapping eviction attempts for the same sandbox via
`singleflight`, cap concurrent evictions at **256** using
`errgroup.TryGo` so the ticker loop never blocks, and expose an
observable gauge (api.evictor.evictions.running) for the in-flight
count.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants