Describe the bug
When an executor fails to launch/decode a task (a non-transient, deterministic error such as a plan deserialization failure), the scheduler treats it the same as losing the executor: it removes the executor, rolls back its stages, and — with a single executor — then reports "There are no alive executors to bind tasks" and the job hangs forever instead of failing.
A task that cannot be decoded will fail the same way on every executor, so this is a query-level error, not an executor-health problem. Marking the executor dead both hides the real error and (with few executors) wedges the whole job.
Observed sequence
Submitting TPC-H q15 (an uncorrelated scalar subquery) on DataFusion 54 with 1 scheduler + 1 executor:
ERROR ballista_scheduler::state: Failed to launch new task: Internal Ballista error:
Failed to connect to executor <id>: Status { code: InvalidArgument, message:
"DataFusion error: Internal error: ScalarSubqueryExpr can only be deserialized as part
of a surrounding ScalarSubqueryExec. ..." }
INFO ballista_scheduler::state::executor_manager: Removing executor <id>: Some("Failed to launch new task: ...")
WARN ballista_scheduler::state::execution_graph: Reset N tasks ... on lost Executor <id>
WARN ballista_scheduler::state::executor_manager: There are no alive executors to bind tasks
DEBUG ballista_scheduler::state: No schedulable tasks found to be launched
The client then blocks indefinitely (both client and scheduler idle at ~0% CPU). The underlying InvalidArgument error is never surfaced to the client.
(The specific decode error above is a separate DataFusion-54 issue — filed separately — but it is a good example: any deterministic per-task decode/launch failure produces this hang.)
Expected behavior
A deterministic task launch/decode failure (e.g. gRPC InvalidArgument, or an error that recurs across attempts) should fail the query and propagate the error to the client, rather than marking the executor dead. Executor removal should be reserved for genuine connectivity/liveness failures (transport errors, timeouts, heartbeat loss).
Suggested direction
- Distinguish "executor unreachable / lost" from "executor rejected this task" at the task-launch site (
ballista/scheduler/src/state/ task launching). InvalidArgument (and similar deterministic statuses) should mark the task/stage/job failed, not the executor.
- Respect the task retry limit for launch failures so a deterministically-undecodable task fails the job after its attempts instead of looping or wedging.
Impact
With a single executor this turns any undecodable task into an indefinite hang with no error reported. In CI this presents as a timeout rather than a clear failure, and one bad query can take down the whole suite.
Describe the bug
When an executor fails to launch/decode a task (a non-transient, deterministic error such as a plan deserialization failure), the scheduler treats it the same as losing the executor: it removes the executor, rolls back its stages, and — with a single executor — then reports "There are no alive executors to bind tasks" and the job hangs forever instead of failing.
A task that cannot be decoded will fail the same way on every executor, so this is a query-level error, not an executor-health problem. Marking the executor dead both hides the real error and (with few executors) wedges the whole job.
Observed sequence
Submitting TPC-H q15 (an uncorrelated scalar subquery) on DataFusion 54 with 1 scheduler + 1 executor:
The client then blocks indefinitely (both client and scheduler idle at ~0% CPU). The underlying
InvalidArgumenterror is never surfaced to the client.(The specific decode error above is a separate DataFusion-54 issue — filed separately — but it is a good example: any deterministic per-task decode/launch failure produces this hang.)
Expected behavior
A deterministic task launch/decode failure (e.g. gRPC
InvalidArgument, or an error that recurs across attempts) should fail the query and propagate the error to the client, rather than marking the executor dead. Executor removal should be reserved for genuine connectivity/liveness failures (transport errors, timeouts, heartbeat loss).Suggested direction
ballista/scheduler/src/state/task launching).InvalidArgument(and similar deterministic statuses) should mark the task/stage/job failed, not the executor.Impact
With a single executor this turns any undecodable task into an indefinite hang with no error reported. In CI this presents as a timeout rather than a clear failure, and one bad query can take down the whole suite.