Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
119 commits
Select commit Hold shift + click to select a range
2a4fd7b
proxy: use token-authenticated grpc resume
matthewlouisbrockman Apr 20, 2026
d40a473
iac: expose api grpc resume route
matthewlouisbrockman Apr 20, 2026
75fe000
tests: cover proxy without autoresume
matthewlouisbrockman Apr 20, 2026
5410a78
proxy: resolve resumed sandbox from catalog
matthewlouisbrockman Apr 20, 2026
a43231b
note on edge proxies needing to re-retrieve from the catalog
matthewlouisbrockman Apr 20, 2026
ff454b7
Add traffic keepalive lifecycle config
matthewlouisbrockman Apr 21, 2026
bddd713
Propagate traffic keepalive through catalog
matthewlouisbrockman Apr 21, 2026
7f466bd
Refresh sandboxes from proxy traffic
matthewlouisbrockman Apr 21, 2026
28acdbd
don't need new grpc token, can use the existing api token
matthewlouisbrockman Apr 21, 2026
a544be8
use apiSecret naming to finish consolidation from separate resume aut…
matthewlouisbrockman Apr 21, 2026
743720b
Allow edge cluster token for proxy resume
matthewlouisbrockman Apr 21, 2026
3295a1d
Merge remote-tracking branch 'origin/main' into feat/byoc-autoresume
matthewlouisbrockman Apr 21, 2026
9f4f0de
Rate limit API gRPC resume calls
matthewlouisbrockman Apr 21, 2026
abaf347
Fix AWS client-proxy gRPC resume wiring
matthewlouisbrockman Apr 21, 2026
9c74628
Apply lint fixes
matthewlouisbrockman Apr 21, 2026
c5d07e0
Retry catalog lookup after auto-resume
matthewlouisbrockman Apr 21, 2026
2ad0816
Merge remote-tracking branch 'origin/main' into feat/byoc-autoresume
matthewlouisbrockman Apr 21, 2026
d8afeb3
Fix client proxy lint
matthewlouisbrockman Apr 21, 2026
57e75a6
Persist snapshot internet access updates
matthewlouisbrockman Apr 21, 2026
4e92684
Retry resumed sandbox proxy connections
matthewlouisbrockman Apr 21, 2026
0ab385e
Handle empty auto-resume routes explicitly
matthewlouisbrockman Apr 22, 2026
a4d9da2
edge returns the ip on autoresume to avoid extra check
matthewlouisbrockman Apr 22, 2026
f3f6d10
Consolidate local route IP handling
matthewlouisbrockman Apr 22, 2026
efa2672
refactor(api): simplify route IP lookup
matthewlouisbrockman Apr 22, 2026
052c078
refactor(orchestrator): reuse traffic token header constant
matthewlouisbrockman Apr 22, 2026
05db929
chore: remove redundant autoresume comments
matthewlouisbrockman Apr 22, 2026
0cdff28
refactor(client-proxy): drop resume catalog fallback
matthewlouisbrockman Apr 22, 2026
7a93b62
use ConstantTimeCompare on secret comparison
matthewlouisbrockman Apr 22, 2026
c51de22
keepalive only via redis
matthewlouisbrockman Apr 23, 2026
de46ab0
feat: make keepalive an object in lifecycle
matthewlouisbrockman Apr 23, 2026
0afb767
fix: make lifecycle keepalive optional
matthewlouisbrockman Apr 23, 2026
15fbf58
not need token internally, only use the cluster token, but looking at…
matthewlouisbrockman Apr 23, 2026
d5b299f
only pass around the cluster auth tokens for the byoc autoresume setup
matthewlouisbrockman Apr 23, 2026
8b658e9
move public grpc to another port to check auth before joining flow
matthewlouisbrockman Apr 23, 2026
c22a1e2
fix client proxy config tag alignment
matthewlouisbrockman Apr 23, 2026
289472d
use oauth org auth for client proxy resume
matthewlouisbrockman Apr 24, 2026
34bc0c2
remove unused client proxy cluster token
matthewlouisbrockman Apr 24, 2026
e9d080b
Merge remote-tracking branch 'origin/main' into feat/byoc-autoresume
matthewlouisbrockman Apr 24, 2026
ffea1d9
avoid public grpc port collision in ci
matthewlouisbrockman Apr 24, 2026
5b36fff
use ci public grpc port 5013
matthewlouisbrockman Apr 24, 2026
8c50ea8
route client proxy autoresume through public grpc auth
matthewlouisbrockman Apr 24, 2026
46724f3
move public api grpc port to 5109
matthewlouisbrockman Apr 24, 2026
0148f8e
revert client proxy public grpc wiring, keep it just for edge
matthewlouisbrockman Apr 24, 2026
fb0998b
remove unused api secret from api job
matthewlouisbrockman Apr 24, 2026
dc5d958
infer client proxy grpc tls from address
matthewlouisbrockman Apr 24, 2026
f0cf3c9
make cluster auth org index nonunique
matthewlouisbrockman Apr 24, 2026
37f54c3
remove resumed sandbox proxy retry override
matthewlouisbrockman Apr 24, 2026
16fe8ad
remove proxy destination retry override
matthewlouisbrockman Apr 24, 2026
feaf9c7
feat(proxy): add catalog-backed traffic keepalive throttle
matthewlouisbrockman Apr 24, 2026
c3f08c7
Merge remote-tracking branch 'origin/main' into feat/traffic-keepalive
matthewlouisbrockman Apr 24, 2026
104a8bb
Route public API gRPC through ingress
matthewlouisbrockman Apr 25, 2026
becd10c
Require bearer metadata before public gRPC lookup
matthewlouisbrockman Apr 25, 2026
60df4df
Infer client proxy API gRPC TLS
matthewlouisbrockman Apr 25, 2026
8150071
Keep local API gRPC plaintext
matthewlouisbrockman Apr 25, 2026
20bbf8b
Remove stale API gRPC named port
matthewlouisbrockman Apr 25, 2026
a849729
Remove unused API cluster token
matthewlouisbrockman Apr 25, 2026
b4e4189
Document autoresume state handling
matthewlouisbrockman Apr 25, 2026
419ae95
Keep autoresume routing info wording
matthewlouisbrockman Apr 25, 2026
435419e
Use noop API gRPC OAuth token source
matthewlouisbrockman Apr 25, 2026
253b461
Remove unrelated snapshot upsert change
matthewlouisbrockman Apr 25, 2026
0b7eebb
Fix autoresume routing error assertion
matthewlouisbrockman Apr 25, 2026
761e74e
Use noop client proxy OIDC verifier
matthewlouisbrockman Apr 25, 2026
1f76ea4
Reuse discovery local IP for route host
matthewlouisbrockman Apr 27, 2026
4548f24
Rename API gRPC ports to internal and edge
matthewlouisbrockman Apr 27, 2026
a0014f2
Rename cluster instance IP to local IP
matthewlouisbrockman Apr 27, 2026
ff35f10
Route edge API gRPC through ingress LB
matthewlouisbrockman Apr 27, 2026
371feaf
Remove legacy edge resume auth metadata
matthewlouisbrockman Apr 28, 2026
b40b179
Move client proxy OAuth verifier out of handlers
matthewlouisbrockman Apr 28, 2026
f0972bc
Move OAuth bearer checks into internal oauth
matthewlouisbrockman Apr 28, 2026
41017d6
Restore proxy traffic access token header
matthewlouisbrockman Apr 28, 2026
da734bd
create func normalizeNodeIP
matthewlouisbrockman Apr 28, 2026
af9208f
split gRPC resume oauth auth modes
matthewlouisbrockman Apr 28, 2026
e320d4a
separate auth from autoresume logic
matthewlouisbrockman Apr 28, 2026
bb52622
Check edge traffic JWT before database access
matthewlouisbrockman Apr 28, 2026
081c0bc
Clean up client proxy OAuth helpers
matthewlouisbrockman Apr 29, 2026
8c3bfa0
Remove unused API gRPC port override
matthewlouisbrockman Apr 29, 2026
5581a43
Merge remote-tracking branch 'origin/main' into feat/byoc-autoresume
matthewlouisbrockman Apr 29, 2026
161fb72
Merge remote-tracking branch 'origin/main' into feat/byoc-autoresume
matthewlouisbrockman Apr 29, 2026
842dfd3
consolidate naming from internal edge on grpc for autoresume
matthewlouisbrockman Apr 29, 2026
b0256e9
auth store tracks team ids
matthewlouisbrockman Apr 29, 2026
7f2b446
auth org id is unique in clusters
matthewlouisbrockman Apr 29, 2026
4a4b4fd
Require autoresume scope for edge resume
matthewlouisbrockman Apr 29, 2026
4529407
rm newline
matthewlouisbrockman Apr 29, 2026
7317c40
note on not having the node ip for cicd
matthewlouisbrockman Apr 29, 2026
93373eb
add autoresume env vars to the .env.gcp.template
matthewlouisbrockman Apr 29, 2026
1bf35c1
Use lifecycle scope for edge autoresume
matthewlouisbrockman Apr 29, 2026
41fe105
refactor(api): dedupe catalog keepalive conversion
matthewlouisbrockman Apr 29, 2026
45a0ede
refactor(proxy): rename sandbox lifecycle client
matthewlouisbrockman Apr 29, 2026
272b5b2
rename api-edge-grpc to grpc-api to keep it consistant with dashboard…
matthewlouisbrockman Apr 30, 2026
cf8744d
use shared const for the required jwt scopes for autoresume over grpc
matthewlouisbrockman Apr 30, 2026
03f6bbd
Merge remote-tracking branch 'origin/main' into feat/byoc-autoresume
matthewlouisbrockman May 1, 2026
9d720b1
Use TLS for external client proxy gRPC
matthewlouisbrockman May 1, 2026
2f99e54
Pass startup context to grpc resume auth
matthewlouisbrockman May 1, 2026
1c45a5d
Clean up grpc resume auth assignment
matthewlouisbrockman May 1, 2026
574b90f
Allow client proxy OIDC without audience
matthewlouisbrockman May 1, 2026
72f09a3
Remove client proxy OIDC audience config
matthewlouisbrockman May 1, 2026
e98abb7
Merge remote-tracking branch 'origin/main' into feat/traffic-keepalive
matthewlouisbrockman May 1, 2026
38e5411
Use catalog traffic keepalive timer
matthewlouisbrockman May 5, 2026
c30c6e0
Clarify traffic keepalive eligibility
matthewlouisbrockman May 5, 2026
c840405
Merge branch 'main' into feat/traffic-keepalive
matthewlouisbrockman May 5, 2026
e9ab050
Merge branch 'feat/byoc-autoresume' into feat/traffic-keepalive
matthewlouisbrockman May 5, 2026
62180b5
Store sandbox routing metadata from edge creates
matthewlouisbrockman May 5, 2026
66bcdba
Persist keepalive policy in orchestrator config
matthewlouisbrockman May 5, 2026
2666e30
Harden traffic keepalive API validation
matthewlouisbrockman May 5, 2026
d4f5b04
Remove unused sandbox timeout persistence
matthewlouisbrockman May 5, 2026
7b863e1
Clean up traffic keepalive review nits
matthewlouisbrockman May 5, 2026
61d88c9
fix(proxy): validate traffic keepalive requests
matthewlouisbrockman May 6, 2026
de2b6c8
chore(api): address keepalive review notes
matthewlouisbrockman May 6, 2026
dcc5a1b
Separate lifecycle object from flat sandbox fields
matthewlouisbrockman May 6, 2026
21be331
chore: auto-commit generated changes
github-actions[bot] May 7, 2026
dc2ab30
Clarify sandbox catalog alias
matthewlouisbrockman May 7, 2026
3b3de0d
Clarify sandbox routing catalog alias
matthewlouisbrockman May 7, 2026
9b4be16
Remove orchestrator variables churn
matthewlouisbrockman May 7, 2026
caa8cca
Apply traffic keepalive lint fixes
matthewlouisbrockman May 7, 2026
cf99daa
Require proxy auth metadata tokens
matthewlouisbrockman May 7, 2026
4868193
Rename client proxy control plane inputs
matthewlouisbrockman May 7, 2026
fe2bbe2
Merge remote-tracking branch 'origin/main' into feat/traffic-keepalive
matthewlouisbrockman May 7, 2026
b24dd2d
Use lifecycle config for sandbox timeout policies
matthewlouisbrockman May 7, 2026
088532b
Revert "Rename client proxy control plane inputs"
matthewlouisbrockman May 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
373 changes: 205 additions & 168 deletions packages/api/internal/api/api.gen.go

Large diffs are not rendered by default.

204 changes: 134 additions & 70 deletions packages/api/internal/handlers/proxy_grpc.go
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ import (
"strconv"
"time"

"github.com/google/uuid"
"go.uber.org/zap"
"google.golang.org/grpc/codes"
"google.golang.org/grpc/metadata"
Expand All @@ -18,6 +19,7 @@ import (
apiorchestrator "github.com/e2b-dev/infra/packages/api/internal/orchestrator"
"github.com/e2b-dev/infra/packages/api/internal/sandbox"
"github.com/e2b-dev/infra/packages/api/internal/utils"
typesteam "github.com/e2b-dev/infra/packages/auth/pkg/types"
dbtypes "github.com/e2b-dev/infra/packages/db/pkg/types"
"github.com/e2b-dev/infra/packages/shared/pkg/consts"
"github.com/e2b-dev/infra/packages/shared/pkg/featureflags"
Expand Down Expand Up @@ -69,7 +71,7 @@ func isNonEnvdTrafficRequest(ctx context.Context, incomingMetadata metadata.MD,
if parseErr != nil {
logger.L().Warn(
ctx,
"invalid sandbox request port metadata for resume",
"invalid sandbox request port metadata for proxy traffic",
zap.Error(parseErr),
zap.String("request_port", requestPortRaw),
logger.WithSandboxID(sandboxID),
Expand Down Expand Up @@ -107,7 +109,9 @@ func (s *SandboxService) getAutoResumeSnapshot(ctx context.Context, sandboxID st

var autoResume *dbtypes.SandboxAutoResumeConfig
if snap.Snapshot.Config != nil {
autoResume = snap.Snapshot.Config.AutoResume
if lifecycle := snap.Snapshot.Config.LifecycleConfig(); lifecycle != nil {
autoResume = lifecycle.AutoResume
}
}
if autoResume == nil || autoResume.Policy != dbtypes.SandboxAutoResumeAny {
return nil, nil, status.Error(codes.NotFound, "sandbox auto-resume disabled")
Expand All @@ -116,21 +120,67 @@ func (s *SandboxService) getAutoResumeSnapshot(ctx context.Context, sandboxID st
return snap, autoResume, nil
}

func (s *SandboxService) ResumeSandbox(ctx context.Context, req *proxygrpc.SandboxResumeRequest) (*proxygrpc.SandboxResumeResponse, error) {
func (s *SandboxService) validateSandboxTraffic(ctx context.Context, sandboxID string, network *dbtypes.SandboxNetworkConfig, envdAccessToken *string) error {
incomingMetadata := metadataFromIncomingContext(ctx)
isNonEnvdTraffic := isNonEnvdTrafficRequest(ctx, incomingMetadata, sandboxID)

// Validate traffic access token for sandboxes with private ingress.
if isPrivateIngressTraffic(network) && isNonEnvdTraffic {
expectedToken, tokenErr := s.api.accessTokenGenerator.GenerateTrafficAccessToken(sandboxID)
if tokenErr != nil {
logger.L().Error(ctx, "failed to generate expected traffic access token", zap.Error(tokenErr), logger.WithSandboxID(sandboxID))

return status.Error(codes.Internal, "failed to validate traffic access token")
}

providedToken, found := metadataFirstValue(incomingMetadata, proxygrpc.MetadataTrafficAccessToken)

if !found || !tokensMatch(providedToken, expectedToken) {
return denyResumePermission()
}
}

// Callers pass envdAccessToken only when envd traffic must enforce it.
if !isNonEnvdTraffic && envdAccessToken != nil {
providedEnvdToken, found := metadataFirstValue(incomingMetadata, proxygrpc.MetadataEnvdAccessToken)

var clientProxyClaims oauth.Claims
if s.requireEdgeClientProxyAuth {
var authErr error
clientProxyClaims, authErr = oauth.RequireClaims(ctx, incomingMetadata, s.clientProxyOAuth)
if authErr != nil {
return nil, authErr
if !found || !tokensMatch(providedEnvdToken, *envdAccessToken) {
return denyResumePermission()
}
if err := oauth.RequireScopeClaims(clientProxyClaims, oauth.RequiredScope); err != nil {
return nil, err
}

return nil
}

func (s *SandboxService) requireClientProxyAuth(ctx context.Context, incomingMetadata metadata.MD, team *typesteam.Team) error {
if !s.requireEdgeClientProxyAuth {
return nil
}

clientProxyClaims, err := oauth.RequireClaims(ctx, incomingMetadata, s.clientProxyOAuth)
if err != nil {
return err
}
if err := oauth.RequireScopeClaims(clientProxyClaims, oauth.RequiredScope); err != nil {
return err
}

var authOrgID string
if team.ClusterID != nil {
cluster, found := s.api.clusters.GetClusterById(*team.ClusterID)
if !found {
return status.Errorf(codes.Internal, "cluster with ID '%s' not found", *team.ClusterID)
}

authOrgID = cluster.AuthOrgID
}

return oauth.RequireOrgClaims(clientProxyClaims, authOrgID)
}

func (s *SandboxService) ResumeSandbox(ctx context.Context, req *proxygrpc.SandboxResumeRequest) (*proxygrpc.SandboxResumeResponse, error) {
incomingMetadata := metadataFromIncomingContext(ctx)

sandboxID, err := utils.ShortID(req.GetSandboxId())
if err != nil {
return nil, status.Error(codes.InvalidArgument, "invalid sandbox ID")
Expand All @@ -148,20 +198,33 @@ func (s *SandboxService) ResumeSandbox(ctx context.Context, req *proxygrpc.Sandb
return nil, status.Errorf(codes.Internal, "failed to get team: %v", err)
}

if s.requireEdgeClientProxyAuth {
var authOrgID string
if team.ClusterID != nil {
cluster, found := s.api.clusters.GetClusterById(*team.ClusterID)
if !found {
return nil, status.Errorf(codes.Internal, "cluster with ID '%s' not found", *team.ClusterID)
}
if err := s.requireClientProxyAuth(ctx, incomingMetadata, team); err != nil {
return nil, err
}

authOrgID = cluster.AuthOrgID
}
minAutoResumeTimeout := time.Duration(s.api.featureFlags.IntFlag(ctx, featureflags.MinAutoResumeTimeoutSeconds)) * time.Second

timeout := calculateAutoResumeTimeout(autoResume, minAutoResumeTimeout, team)

var envdAccessToken *string
if snap.Snapshot.EnvSecure {
accessToken, tokenErr := s.api.getEnvdAccessToken(snap.EnvBuild.EnvdVersion, sandboxID)
if tokenErr != nil {
logger.L().Error(ctx, "Secure envd access token error", zap.Error(tokenErr.Err), logger.WithSandboxID(sandboxID))

if err := oauth.RequireOrgClaims(clientProxyClaims, authOrgID); err != nil {
return nil, err
return nil, status.Error(codes.Internal, "failed to create envd access token")
}

envdAccessToken = &accessToken
}

var network *dbtypes.SandboxNetworkConfig
if snap.Snapshot.Config != nil {
network = snap.Snapshot.Config.Network
}

if trafficErr := s.validateSandboxTraffic(ctx, sandboxID, network, envdAccessToken); trafficErr != nil {
return nil, trafficErr
}

sandboxData, sandboxErr := s.api.orchestrator.GetSandbox(ctx, teamID, sandboxID)
Expand Down Expand Up @@ -195,54 +258,6 @@ func (s *SandboxService) ResumeSandbox(ctx context.Context, req *proxygrpc.Sandb
}
}

minAutoResumeTimeout := time.Duration(s.api.featureFlags.IntFlag(ctx, featureflags.MinAutoResumeTimeoutSeconds)) * time.Second

timeout := calculateAutoResumeTimeout(autoResume, minAutoResumeTimeout, team)

var envdAccessToken *string
if snap.Snapshot.EnvSecure {
accessToken, tokenErr := s.api.getEnvdAccessToken(snap.EnvBuild.EnvdVersion, sandboxID)
if tokenErr != nil {
logger.L().Error(ctx, "Secure envd access token error", zap.Error(tokenErr.Err), logger.WithSandboxID(sandboxID))

return nil, status.Error(codes.Internal, "failed to create envd access token")
}

envdAccessToken = &accessToken
}

var network *dbtypes.SandboxNetworkConfig
if snap.Snapshot.Config != nil {
network = snap.Snapshot.Config.Network
}

isNonEnvdTraffic := isNonEnvdTrafficRequest(ctx, incomingMetadata, sandboxID)

// Validate traffic access token for sandboxes with private ingress.
if isPrivateIngressTraffic(network) && isNonEnvdTraffic {
expectedToken, tokenErr := s.api.accessTokenGenerator.GenerateTrafficAccessToken(sandboxID)
if tokenErr != nil {
logger.L().Error(ctx, "failed to generate expected traffic access token", zap.Error(tokenErr), logger.WithSandboxID(sandboxID))

return nil, status.Error(codes.Internal, "failed to validate traffic access token")
}

providedToken, _ := metadataFirstValue(incomingMetadata, proxygrpc.MetadataTrafficAccessToken)

if !tokensMatch(providedToken, expectedToken) {
return nil, denyResumePermission()
}
}

// Validate envd access token for secure sandboxes on envd traffic
if !isNonEnvdTraffic && snap.Snapshot.EnvSecure && envdAccessToken != nil {
providedEnvdToken, _ := metadataFirstValue(incomingMetadata, proxygrpc.MetadataEnvdAccessToken)

if !tokensMatch(providedEnvdToken, *envdAccessToken) {
return nil, denyResumePermission()
}
}

headers := http.Header{}
sbx, apiErr := s.api.startSandboxInternal(
ctx,
Expand All @@ -265,3 +280,52 @@ func (s *SandboxService) ResumeSandbox(ctx context.Context, req *proxygrpc.Sandb

return &proxygrpc.SandboxResumeResponse{OrchestratorIp: nodeIP}, nil
}

func (s *SandboxService) KeepAliveSandbox(ctx context.Context, req *proxygrpc.SandboxKeepAliveRequest) (*proxygrpc.SandboxKeepAliveResponse, error) {
incomingMetadata := metadataFromIncomingContext(ctx)

sandboxID, err := utils.ShortID(req.GetSandboxId())
if err != nil {
return nil, status.Error(codes.InvalidArgument, "invalid sandbox ID")
}

teamID, err := uuid.Parse(req.GetTeamId())
if err != nil {
return nil, status.Error(codes.InvalidArgument, "invalid team ID")
}

team, err := s.api.authService.GetTeamByID(ctx, teamID)
if err != nil {
return nil, status.Errorf(codes.Internal, "failed to get team: %v", err)
}

if err := s.requireClientProxyAuth(ctx, incomingMetadata, team); err != nil {
return nil, err
}

sandboxData, err := s.api.orchestrator.GetSandbox(ctx, teamID, sandboxID)
if err != nil {
if errors.Is(err, sandbox.ErrNotFound) {
return nil, status.Error(codes.NotFound, "sandbox not found")
}

return nil, status.Errorf(codes.Internal, "failed to get sandbox state: %v", err)
}

trafficKeepalive := sandboxData.TrafficKeepalive()
if trafficKeepalive == nil {
return nil, status.Error(codes.FailedPrecondition, "sandbox traffic keepalive disabled")
}

if trafficErr := s.validateSandboxTraffic(ctx, sandboxID, sandboxData.Network, sandboxData.EnvdAccessToken); trafficErr != nil {
return nil, trafficErr
}

timeout := time.Duration(trafficKeepalive.Timeout) * time.Second

if _, apiErr := s.api.orchestrator.KeepAliveFor(ctx, teamID, sandboxID, timeout, false); apiErr != nil {
return nil, status.Error(sharedutils.GRPCCodeFromHTTPStatus(apiErr.Code), apiErr.ClientMsg)
}

return &proxygrpc.SandboxKeepAliveResponse{}, nil
}
77 changes: 72 additions & 5 deletions packages/api/internal/handlers/sandbox_create.go
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ import (
"github.com/e2b-dev/infra/packages/shared/pkg/id"
sbxlogger "github.com/e2b-dev/infra/packages/shared/pkg/logger/sandbox"
"github.com/e2b-dev/infra/packages/shared/pkg/middleware/otel/metrics"
sandboxroutingcatalog "github.com/e2b-dev/infra/packages/shared/pkg/sandbox-catalog"
sandbox_network "github.com/e2b-dev/infra/packages/shared/pkg/sandbox-network"
"github.com/e2b-dev/infra/packages/shared/pkg/telemetry"
sharedUtils "github.com/e2b-dev/infra/packages/shared/pkg/utils"
Expand Down Expand Up @@ -141,7 +142,16 @@ func (a *APIStore) PostSandboxes(c *gin.Context) {
telemetry.WithFirecrackerVersion(build.FirecrackerVersion),
)

if lifecycleErr := validateLifecycleAliases(body); lifecycleErr != nil {
a.sendAPIStoreError(c, lifecycleErr.Code, lifecycleErr.ClientMsg)

return
}

autoPause := sharedUtils.DerefOrDefault(body.AutoPause, sandbox.AutoPauseDefault)
if body.Lifecycle != nil && body.Lifecycle.OnTimeout != nil {
autoPause = *body.Lifecycle.OnTimeout == api.Pause
}
envVars := sharedUtils.DerefOrDefault(body.EnvVars, nil)
mcp := sharedUtils.DerefOrDefault(body.Mcp, nil)
metadata := sharedUtils.DerefOrDefault(body.Metadata, nil)
Expand All @@ -159,10 +169,19 @@ func (a *APIStore) PostSandboxes(c *gin.Context) {
}

autoResume := buildAutoResumeConfig(body.AutoResume)
if body.Lifecycle != nil && body.Lifecycle.AutoResume != nil {
autoResume = buildAutoResumeConfigFromEnabled(*body.Lifecycle.AutoResume)
}
if autoResume != nil {
minAutoResumeTimeout := time.Duration(a.featureFlags.IntFlag(ctx, featureflags.MinAutoResumeTimeoutSeconds)) * time.Second
autoResume.Timeout = calculateTimeoutSeconds(timeout, minAutoResumeTimeout, teamInfo)
}
keepalive, keepaliveErr := buildKeepaliveConfig(body.Lifecycle)
if keepaliveErr != nil {
a.sendAPIStoreError(c, keepaliveErr.Code, keepaliveErr.ClientMsg)

return
}

var envdAccessToken *string = nil
if body.Secure != nil && *body.Secure == true {
Expand Down Expand Up @@ -250,10 +269,13 @@ func (a *APIStore) PostSandboxes(c *gin.Context) {
Alias: alias,
TemplateID: env.TemplateID,
BaseTemplateID: env.TemplateID,
AutoPause: autoPause,
AutoResume: autoResume,
VolumeMounts: sbxVolumeMounts,
EnvdAccessToken: envdAccessToken,
Lifecycle: types.SandboxLifecycleConfig{
AutoPause: autoPause,
AutoResume: autoResume,
Keepalive: keepalive,
},
VolumeMounts: sbxVolumeMounts,
EnvdAccessToken: envdAccessToken,
}, nil
}

Expand Down Expand Up @@ -294,8 +316,12 @@ func buildAutoResumeConfig(autoResume *api.SandboxAutoResumeConfig) *types.Sandb
return nil
}

return buildAutoResumeConfigFromEnabled(autoResume.Enabled)
}

func buildAutoResumeConfigFromEnabled(enabled bool) *types.SandboxAutoResumeConfig {
policy := types.SandboxAutoResumeOff
if autoResume.Enabled {
if enabled {
policy = types.SandboxAutoResumeAny
}

Expand All @@ -304,6 +330,47 @@ func buildAutoResumeConfig(autoResume *api.SandboxAutoResumeConfig) *types.Sandb
}
}

func validateLifecycleAliases(body api.NewSandbox) *api.APIError {
if body.Lifecycle == nil {
return nil
}

if body.AutoPause != nil && body.Lifecycle.OnTimeout != nil {
return &api.APIError{Code: http.StatusBadRequest, ClientMsg: "autoPause and lifecycle.onTimeout cannot both be set"}
}

if body.AutoResume != nil && body.Lifecycle.AutoResume != nil {
return &api.APIError{Code: http.StatusBadRequest, ClientMsg: "autoResume and lifecycle.autoResume cannot both be set"}
}

return nil
}

func buildKeepaliveConfig(lifecycle *api.NewSandboxLifecycle) (*types.SandboxKeepaliveConfig, *api.APIError) {
if lifecycle == nil || lifecycle.Keepalive == nil || lifecycle.Keepalive.Traffic == nil {
return nil, nil
}

timeout := types.SandboxTrafficKeepaliveTimeoutDefault
if lifecycle.Keepalive.Traffic.Timeout != nil {
if *lifecycle.Keepalive.Traffic.Timeout < 0 {
return nil, &api.APIError{Code: http.StatusBadRequest, ClientMsg: "Traffic keepalive timeout cannot be negative"}
}
if time.Duration(*lifecycle.Keepalive.Traffic.Timeout)*time.Second <= sandboxroutingcatalog.TrafficKeepaliveThrottleInterval {
return nil, &api.APIError{Code: http.StatusBadRequest, ClientMsg: fmt.Sprintf("Traffic keepalive timeout must be greater than %d seconds", int(sandboxroutingcatalog.TrafficKeepaliveThrottleInterval.Seconds()))}
}

timeout = uint64(*lifecycle.Keepalive.Traffic.Timeout)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keepalive timeout stored without upper-bound validation

Low Severity

buildKeepaliveConfig validates that the traffic keepalive timeout is greater than the throttle interval but never validates an upper bound. A caller could supply an arbitrarily large timeout (up to int32 max, ~68 years), which would be stored and later used verbatim in KeepAliveFor. While getMaxAllowedTTL caps the actual extension at MaxInstanceLength, the uncapped stored value is misleading and means the TrafficKeepaliveThrottleInterval error message ("must be greater than N seconds") could fire for a zero or negative timeout that should have been caught by the earlier < 0 check — specifically, no check prevents Timeout = 0 being accepted when 0 > TrafficKeepaliveThrottleInterval is false, making a zero timeout silently pass validation but produce a no-op keepalive.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b24dd2d. Configure here.


return &types.SandboxKeepaliveConfig{
Traffic: &types.SandboxTrafficKeepaliveConfig{
Enabled: lifecycle.Keepalive.Traffic.Enabled,
Timeout: timeout,
},
}, nil
}

func dedupeVolumeNames(items []api.SandboxVolumeMount) []string {
itemsSet := make(map[string]struct{}, len(items))
for _, item := range items {
Expand Down
Loading
Loading