Skip to content

Cluster total capacity 9*500=4500 but reject create after only ~1200 sandboxes with "too many sandboxes starting on this node" #2893

Description

@AdaAibaby

Problem Summary
We have a cluster with 9 nodes, each configured max sandbox limit = 500, total cluster theoretical capacity: 9 × 500 = 4500 sandboxes.
But sandbox creation fails after only ~1200 running sandboxes, gRPC returns ResourceExhausted: too many sandboxes starting on this node, please retry.
No single node reaches its individual 500 quota, orchestrator incorrectly throttles global allocation early.
Error Log
SandboxService/Create/unary [ResourceExhausted]: finished call {"service": "orchestrator_template-manager", "internal": true, "pid": 936536, "protocol": "grpc", "grpc.component": "server", "grpc.service": "SandboxService", "grpc.method": "Create", "grpc.method_type": "unary", "peer.address": "10.254.72.8:50292", "grpc.start_time": "2026-06-02T10:59:52+08:00", "grpc.request.deadline": "2026-06-02T11:01:02+08:00", "grpc.code": "ResourceExhausted", "grpc.error": "rpc error: code = ResourceExhausted desc = too many sandboxes starting on this node, please retry", "grpc.time_ms": "0.536", "trace_id": "c08e21fe0cc8487da4b3f6421aca931f", "span_id": "e50dfaf753372846"}

Expected behavior
Each node enforces limit = 500 only for active+starting sandboxes on itself;
Allow cluster to scale up to near 4500 total sandboxes before throttling;
Auto cleanup leaked orphan FC processes to release occupied quota.
Reproduce steps
Deploy 9 nodes, per-node max=500
Gradually create sandboxes until total ~1200
All new Create requests hit ResourceExhausted error immediately
/cc @jakubno

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions