Skip to content

Scheduler marks executor dead on a deterministic task launch/decode failure, hanging the job #1908

Description

@andygrove

Describe the bug

When an executor fails to launch/decode a task (a non-transient, deterministic error such as a plan deserialization failure), the scheduler treats it the same as losing the executor: it removes the executor, rolls back its stages, and — with a single executor — then reports "There are no alive executors to bind tasks" and the job hangs forever instead of failing.

A task that cannot be decoded will fail the same way on every executor, so this is a query-level error, not an executor-health problem. Marking the executor dead both hides the real error and (with few executors) wedges the whole job.

Observed sequence

Submitting TPC-H q15 (an uncorrelated scalar subquery) on DataFusion 54 with 1 scheduler + 1 executor:

ERROR ballista_scheduler::state: Failed to launch new task: Internal Ballista error:
  Failed to connect to executor <id>: Status { code: InvalidArgument, message:
  "DataFusion error: Internal error: ScalarSubqueryExpr can only be deserialized as part
   of a surrounding ScalarSubqueryExec. ..." }
INFO  ballista_scheduler::state::executor_manager: Removing executor <id>: Some("Failed to launch new task: ...")
WARN  ballista_scheduler::state::execution_graph: Reset N tasks ... on lost Executor <id>
WARN  ballista_scheduler::state::executor_manager: There are no alive executors to bind tasks
DEBUG ballista_scheduler::state: No schedulable tasks found to be launched

The client then blocks indefinitely (both client and scheduler idle at ~0% CPU). The underlying InvalidArgument error is never surfaced to the client.

(The specific decode error above is a separate DataFusion-54 issue — filed separately — but it is a good example: any deterministic per-task decode/launch failure produces this hang.)

Expected behavior

A deterministic task launch/decode failure (e.g. gRPC InvalidArgument, or an error that recurs across attempts) should fail the query and propagate the error to the client, rather than marking the executor dead. Executor removal should be reserved for genuine connectivity/liveness failures (transport errors, timeouts, heartbeat loss).

Suggested direction

  • Distinguish "executor unreachable / lost" from "executor rejected this task" at the task-launch site (ballista/scheduler/src/state/ task launching). InvalidArgument (and similar deterministic statuses) should mark the task/stage/job failed, not the executor.
  • Respect the task retry limit for launch failures so a deterministically-undecodable task fails the job after its attempts instead of looping or wedging.

Impact

With a single executor this turns any undecodable task into an indefinite hang with no error reported. In CI this presents as a timeout rather than a clear failure, and one bad query can take down the whole suite.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions