Symptoms: Sandbox status remains starting for more than 60 seconds.
Possible causes:
- Image pull failure — Runtime image not available in the cluster.
- Insufficient resources — Node doesn't have enough CPU/memory.
- Sidecar not ready — Sidecar container failing readiness probe.
Resolution:
# Check pod status
kubectl get pods -n xgen-sandboxes -l xgen.io/sandbox-id=<SANDBOX_ID>
# Check pod events
kubectl describe pod sbx-<SANDBOX_ID> -n xgen-sandboxes
# Check sidecar logs
kubectl logs sbx-<SANDBOX_ID> -c sidecar -n xgen-sandboxesSymptoms: exec calls fail with "no sidecar connection" or timeout.
Possible causes:
- Pod IP not reachable — Network policy blocking agent-to-sandbox traffic.
- Sidecar crashed — Check sidecar container status.
Resolution:
# Verify sidecar is running
kubectl get pod sbx-<SANDBOX_ID> -n xgen-sandboxes -o jsonpath='{.status.containerStatuses[?(@.name=="sidecar")].ready}'
# Test connectivity from agent
kubectl exec -n xgen-system deploy/xgen-agent -- wget -qO- http://<POD_IP>:9001/readyzSymptoms: Warm pool shows 0 available pods despite WARM_POOL_SIZE > 0.
Possible causes:
- ResourceQuota exhausted — Check namespace quotas.
- Image not available — Runtime image not loaded in cluster.
Resolution:
kubectl get resourcequota -n xgen-sandboxes
kubectl get events -n xgen-sandboxes --sort-by='.lastTimestamp' | tail -20Symptoms: Agent pod in CrashLoopBackOff.
Possible causes:
- Missing secrets —
API_KEYorJWT_SECRETnot set (required since v0.2). - K8s API unreachable — ServiceAccount permissions issue.
Resolution:
kubectl logs -n xgen-system deploy/xgen-agent --previous
kubectl get secret xgen-agent-secrets -n xgen-systemSymptoms: All API calls return 401 after login.
Possible causes:
- JWT expired — Default expiry is 15 minutes.
- Clock skew — Agent and client have different system times.
Resolution: Log out and log in again with a valid API key.
Symptoms: Terminal or streaming connections drop unexpectedly.
Possible causes:
- Load balancer timeout — Default idle timeouts (e.g., 60s on ALB).
- Rate limiting — Too many concurrent connections.
Resolution:
- Configure load balancer idle timeout to 300s+
- Check
RATE_LIMIT_PER_MINUTEsetting - SDKs automatically reconnect (up to 5 attempts with exponential backoff)