Skip to content

Code Interpreter REPL Process Fails to Start (Sandbox Timeout) - on AWS ECS EC2 based Tasks/Docker Workloads #114

@Zirkonium88

Description

@Zirkonium88

Description

The Code Interpreter API fails to start REPL processes inside nsjail sandboxes. The REPL process never becomes ready, timing out after 60 seconds and causing all code execution requests to fail.

Steps to Reproduce

  1. Deploy the Code Interpreter service on ECS EC2 (with nsjail/sandbox support)
  2. Send a code execution request to the Code Interpreter agent via LibreChat
  3. The API creates a sandbox but the REPL process never initializes
  4. After 60s timeout, execution fails

Expected Behavior

The REPL process starts successfully inside the nsjail sandbox and code execution completes normally.

Actual Behavior

The REPL process fails to start. The sandbox times out after 60 seconds with no specific error message in the error field, indicating the process silently fails during initialization.

Environment

  • Service: code-interpreter-api v1.2.0
  • Infrastructure: AWS ECS on EC2 (m5.large), privileged: true, SYS_ADMIN capability
  • Container: Custom patched image based on ghcr.io/usnavy13/librecodeinterpreter

Logs/Screenshots

{
  "sandbox_id": "e41c9dd4abc9",
  "timeout": 60,
  "event": "REPL ready timeout",
  "logger": "src.services.sandbox.repl_executor",
  "level": "warning",
  "timestamp": "2026-05-27T04:51:30.275965Z",
  "service": "code-interpreter-api",
  "version": "1.2.0"
}
{
  "sandbox_id": "e41c9dd4abc9",
  "error": "",
  "event": "Failed to start REPL process",
  "logger": "src.services.sandbox.pool",
  "level": "error",
  "timestamp": "2026-05-27T04:51:30.276189Z",
  "service": "code-interpreter-api",
  "version": "1.2.0"
}

Root Cause & Resolution

The nsjail sandbox requires /app/ssl and /app/data directories to exist at runtime. Without them, the REPL process fails silently during sandbox initialization.

Fix: Added mkdir -p /app/ssl /app/data to the container command before starting the application:

command=[
    "/bin/sh",
    "-c",
    "mkdir -p /app/ssl /app/data && "
    "mount -t proc proc /var/lib/code-interpreter/empty_proc && "
    "exec python3 -m src.main",
],

The task definition also requires:

  • privileged: True — needed for the proc mount and nsjail
  • SYS_ADMIN capability — required by nsjail for namespace operations
  • init_process_enabled: True — proper zombie process reaping inside sandboxes

Additional Context

The empty error field in the logs made this hard to diagnose — nsjail doesn't surface missing directory errors to the application layer. The fix ensures the required directory structure exists before the sandbox pool initializes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions