Skip to content

Clarify EndEvent.error semantics for normal non-zero exits in envd's process handler #2657

@zyl1121

Description

@zyl1121

Summary

I would like to clarify the intended semantics of ProcessEvent.EndEvent.error in envd.

Currently, when a process starts successfully but exits with a non-zero status, envd sets both exit_code and error. For example, a normal exit 1 is reported roughly as:

exit_code: 1
exited: true
status: "exit status 1"
error: "exit status 1"

This makes it unclear whether error represents a runtime/infrastructure-level failure, or simply any non-zero process result.

This is not tied to a specific E2B Cloud Sandbox ID or Build ID. The behavior was observed in a downstream integration using envd's Process API, and it also appears to follow from the current process handler code path.

Current behavior

In packages/envd/internal/services/process/handler/handler.go, Wait() calls cmd.Wait() and copies any returned error into EndEvent.error:

err := p.cmd.Wait()

var errMsg *string
if err != nil {
    msg := err.Error()
    errMsg = &msg
}

endEvent := &rpc.ProcessEvent_EndEvent{
    Error:    errMsg,
    ExitCode: int32(p.cmd.ProcessState.ExitCode()),
    Exited:   p.cmd.ProcessState.Exited(),
    Status:   p.cmd.ProcessState.String(),
}

However, in Go, exec.Cmd.Wait() returns *exec.ExitError when the process starts successfully but exits with a non-zero status.

So a normal command failure such as exit 1 becomes both:

exit_code = 1
error = "exit status 1"

This appears to be the relevant envd code path that constructs the process end event for Process.Start.

Question: should normal non-zero exits set EndEvent.error?

The current proto definition is:

message EndEvent {
    sint32 exit_code = 1;
    bool exited = 2;
    string status = 3;
    optional string error = 4;
}

The part I would like to clarify is the relationship between exit_code and error.

When a process starts successfully and exits with a non-zero code, should that be represented only by exit_code / exited / status, or should it also populate error?

My expectation is that exit_code should describe the result of the user's process, while error should describe envd/runtime-level failures, such as failing to start, wait for, or observe the process.

If this is the intended contract, it would be helpful to document it in process.proto so clients can handle process failures consistently.

Why this matters

For Process API clients, the current behavior makes it difficult to distinguish between:

  1. the process ran and returned a non-zero exit code; and
  2. envd/runtime failed to start, run, wait for, or observe the process.

In a downstream Kubernetes operator integration, this caused failed lifecycle hook scripts to be treated as runtime execution errors. As a result, the client entered an error-handling path and lost the collected stdout/stderr from the failed script.

The user-visible error became only:

hook execution error: process error: exit status 1

instead of showing the actual script output.

Root cause

The underlying issue seems to be the mapping from Go's os/exec API to the envd Process API.

exec.Cmd.Wait() uses an error return for normal non-zero process exits via *exec.ExitError. That is useful at the Go API level, but it may not map cleanly to the RPC-level EndEvent.error field if that field is intended to mean a runtime/infrastructure-level failure.

The current envd implementation passes the cmd.Wait() error through to EndEvent.error without distinguishing *exec.ExitError from other wait/runtime errors.

Possible direction

If the intended semantics are to keep error for runtime/infrastructure-level failures, envd could special-case *exec.ExitError from cmd.Wait() when ProcessState is valid and the process has exited:

err := p.cmd.Wait()

var errMsg *string
if err != nil{
    var exitErr *exec.ExitError
    if !(errors.As(err, &exitErr) && p.cmd.ProcessState != nil && p.cmd.ProcessState.Exited()) {
        msg := err.Error()
        errMsg = &msg
    }
}

Normal non-zero exits would then be represented by exit_code, exited, and status, while error would remain available for runtime/infrastructure-level failures.

It may also be useful to add comments to process.proto documenting the intended meaning of error.

Compatibility note

This could be a behavior change for existing Process API clients that currently rely on EndEvent.error being set for non-zero exits.

So I would like to confirm the intended semantics before submitting a PR. I am happy to contribute a PR if maintainers agree with this direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions