Skip to content
Open
1 change: 1 addition & 0 deletions changes/11322.doc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add BEP-1053 (agent-level batch entrypoint retry) and BEP-1054 (session rescheduling on terminal failure) covering the two-tier batch resilience design — in-script retry stays on the agent; node-level failures reschedule the same session through the existing scheduler lifecycle handlers
131 changes: 131 additions & 0 deletions proposals/BEP-1053-agent-batch-retry.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
Author: Jeongseok Kang (jskang@lablup.com)
Status: Draft
Created: 2026-04-27
Created-Version: 26.5.0
Target-Version:
Implemented-Version:
---

# Agent-level Batch Retry

## Related Issues

- JIRA: BA-5851
- GitHub Epic: #11320
- GitHub: #11321
- Companion BEP: [BEP-1054 — Session Rescheduling on Terminal Failure](BEP-1054-session-rescheduling-on-terminal-failure.md)

## Motivation

When a `BATCH` session's entrypoint exits non-zero, the session is marked failed and the user must manually re-submit. Most batch failures in practice are transient (a flaky network call, a downstream service hiccup, an intermittent dependency error) and a simple in-place re-run would have succeeded. Today the user pays the cost of re-creating the session — re-scheduling, re-pulling the image, re-mounting volumes — for a problem that is purely inside the script.

This BEP adds a small **agent-side** knob: re-run the batch entrypoint inside the same kernel up to N times before reporting failure. It is the simpler, smaller half of the batch-retry feature; the companion BEP-1054 covers the case where the failure is at the *node* level and a fresh schedule is needed.

### Goals

- Opt-in retry of the batch entrypoint inside an existing kernel.
- No new manager-side state, tables, or events.
- Default `batch_retries = 0` keeps current behavior.
- Per-session knob; no policy framework needed at this layer.

### Non-goals

- Failures before the kernel is running (image pull, scheduling). Those go to BEP-1054.
- OOM and node-level failures. Re-running on the same node typically does not help; BEP-1054 handles them by rescheduling.
- A user-supplied retry-policy DSL with backoff and classification. Out of scope for v1; if needed, accrue evidence first and design separately.

## Current Design

The agent runs batch entrypoints in `Agent.execute_batch()` (`src/ai/backend/agent/agent.py:2406`). The path:

1. Kernel reaches the running state.
2. If `kernel_obj.session_type == SessionTypes.BATCH` (`agent.py:2274`), the agent enqueues `execute_batch(session_id, kernel_id, startup_command, batch_timeout)` into `_ongoing_exec_batch_tasks` (line 840).
3. `execute_batch` invokes the kernel runner via `kernel.execute(...)` once.
4. On a non-zero exit code (or timeout), the agent emits `SessionFailureAnycastEvent` and `SessionFailureBroadcastEvent` (lines 2375, 2389, 2464, 2478, 2492).
5. On success, it emits `SessionSuccessAnycastEvent`/`SessionSuccessBroadcastEvent`.

There is no in-script retry — the entrypoint runs exactly once per session. `RestartTracker` (line 757) handles *kernel* restart on agent crash recovery, not script re-execution.

## Proposed Design

### Knob

Two new fields on the batch session creation request, plumbed through the existing kernel-config path that already carries `startup_command` and `batch_timeout`:

| Field | Type | Default | Meaning |
|---|---|---|---|
| `batch_retries` | int (≥ 0) | `0` | Maximum number of additional `execute_batch` attempts after the first. Total attempts = `1 + batch_retries`. |
| `batch_retry_delay` | float seconds (≥ 0) | `0.0` | Wait between attempts. Constant; no backoff at this layer. |

The two fields sit alongside `startup_command`, `bootstrap_script`, and `batch_timeout` in the session creation DTO. They are batch-only — the agent ignores them when `session_type != SessionTypes.BATCH`.

### Execution loop

`execute_batch` becomes:

```python
async def execute_batch(self, session_id, kernel_id, startup_command, batch_timeout,
batch_retries: int = 0, batch_retry_delay: float = 0.0):
last_exit_code: int | None = None
for attempt in range(batch_retries + 1):
if attempt > 0:
log.info("execute_batch(k:{}) retry attempt {}/{}", kernel_id, attempt, batch_retries)
await asyncio.sleep(batch_retry_delay)
last_exit_code = await self._run_batch_once(session_id, kernel_id, startup_command, batch_timeout)
if last_exit_code == 0:
await self._emit_session_success(session_id, kernel_id)
return
# else: non-zero exit -> retry if attempts remain
# exhausted
await self._emit_session_failure(session_id, kernel_id, last_exit_code)
```

Only **non-zero exit codes** trigger a retry. Cancellation, timeout, and infrastructure errors (kernel disconnect, container crash) do **not** loop here:
- Cancellation propagates as today.
- Timeout (`KernelLifecycleEventReason.TASK_TIMEOUT`, `agent.py:2492`) emits failure as today; rerunning a script that already ran past `batch_timeout` is unhelpful.
- Container-level failures escalate to BEP-1054's domain.

### Observability

- `bai_agent_batch_retry_attempted_total{session_id_type=batch}` counter (per attempt beyond the first).
- `bai_agent_batch_retry_succeeded_total` counter (incremented when a retry attempt exits zero).
- `bai_agent_batch_retry_exhausted_total` counter (incremented when the loop ends with non-zero).
- Each retry attempt logged at INFO with `(kernel_id, attempt, max_attempts)`.
- The existing failure event is emitted only on final exhaustion; no new event types.

### What does **not** change

- Session lifecycle, statuses, or transitions.
- Manager-side handlers (`SessionEventHandler`, sokovan).
- Database schema.
- `creation_id`, `parent_session_id` (does not exist), retry chain (does not exist).
- API surface beyond the two new fields on the create request.

The only manager-side change is plumbing `batch_retries` and `batch_retry_delay` from the create request into the kernel config payload that the agent already receives.

## Migration / Compatibility

- Default `batch_retries = 0` preserves current behavior for every existing caller.
- New fields are additive on the create request and on responses (echoed back for visibility).
- No Alembic migration required.
- Operators have a per-session opt-out by leaving the field unset; no global kill switch needed because the feature is opt-in.

## Implementation Plan

Two PRs:

1. **BEP draft** (this document) plus the companion BEP-1054 — #11321.
2. **Agent change:** extend `execute_batch` with the retry loop, plumb `batch_retries`/`batch_retry_delay` from kernel config, add metrics, unit tests around the loop semantics.
3. **Client surface:** SDK v2 + CLI v2 accept the two new fields on `./bai session create -t batch`. REST v2 / GraphQL v2 echo them on session info responses.

Tests live with the code under test. The agent's batch executor has existing test scaffolding; the loop is the smallest possible delta.

Estimated effort: under one week for one engineer, given the constrained scope.

## References

- Companion: [BEP-1054 — Session Rescheduling on Terminal Failure](BEP-1054-session-rescheduling-on-terminal-failure.md)
- Working draft of the prior single-BEP design and the pivot rationale: `docs/investigation/bep-1053-design-pivot.md`
- Apache Airflow's `retries` parameter (the inspirational reference): `airflow-core/src/airflow/models/taskinstance.py:1109-1159`
- [BEP-1030: Sokovan Scheduler Status Transition Design](BEP-1030-sokovan-scheduler-status-transition.md)
169 changes: 169 additions & 0 deletions proposals/BEP-1054-session-rescheduling-on-terminal-failure.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
---
Author: Jeongseok Kang (jskang@lablup.com)
Status: Draft
Created: 2026-04-27
Created-Version: 26.5.0
Target-Version:
Implemented-Version:
---

# Session Rescheduling on Terminal Failure

## Related Issues

- JIRA: BA-5851
- GitHub Epic: #11320
- GitHub: #11321
- Companion BEP: [BEP-1053 — Agent-level Batch Retry](BEP-1053-agent-batch-retry.md)

## Motivation

Some session failures are **node-level**: the kernel was OOM-killed on this host, the agent disconnected mid-run, the registry route used by this scaling group is briefly down, the network namespace setup failed for a node-specific reason. For these cases, re-running the script in place — Backend.AI's existing scheduler-internal retries, or BEP-1053's agent-level batch retry — does not help. What does help is **rescheduling the same session to a different node**, with the same resource allocation.

Today, terminal-failure sessions stay terminal. There is no path that takes a session in `ERROR` and pushes it back through the scheduler. Operators have to ask users to re-create their sessions, often after diagnosing that the failure was the host's fault, not the user's. This BEP closes that gap.

It is the companion to [BEP-1053](BEP-1053-agent-batch-retry.md), which handles in-script retry; together they cover the two distinct retry surfaces. They are designed to ship independently.

### Goals

- Re-dispatch a terminal-failed `BATCH` session through the scheduler when the failure is classified as **node-level**.
- Reuse existing scheduler infrastructure: `SessionLifecycleHandler`, `phase_attempts`, scheduling history, the `expired → PENDING` transition pattern.
- Make failure classification **operator-extensible** — etcd-driven pattern config, not a closed enum in code.
- Promote the standing `SERVICE_MAX_RETRIES = 5 # FIXME: make configurable` (`manager/defs.py:121`) to a real configuration knob as a side effect.
- Default off; opt-in per scaling group.

### Non-goals

- Mutating resource allocation (no "give it more memory and retry"). Resource decisions stay with the user/admin.
- User-facing per-session `RetryPolicy` with backoff/jitter/max. Rescheduling is operator-policy, not user-policy.
- Interactive or inference sessions. INTERACTIVE is user-driven; INFERENCE has BEP-1049 deployment-route handling.
- Re-running the user script in place. That is BEP-1053's job.

## Current Design

### Session lifecycle and terminal status

`SessionStatus` (`src/ai/backend/manager/data/session/types.py:30-50`) defines the lifecycle. `terminal_statuses()` (line 109) is `{ERROR, TERMINATED, CANCELLED}` — no transitions out today. `retriable_statuses()` (line 118) is the scheduler's *in-session* retriable set; it does not apply to sessions already in `ERROR`.

### Sokovan lifecycle handlers

Periodic `SessionLifecycleHandler`s drive scheduler decisions (`sokovan/scheduler/handlers/`). Each declares `success / need_retry / expired / give_up` outcomes and the status transitions for each (`base.py:62-93`). Existing handlers include `CheckPreconditionLifecycleHandler` and `StartSessionsLifecycleHandler`, which use the **`expired → PENDING`** transition pattern (`check_precondition.py:67`, `start_sessions.py:78`) — the canonical "re-schedule this session" mechanism, scoped today to startup-stage timeouts.

### Existing counters and caps

- `phase_attempts` (`sokovan/data/lifecycle.py:322`): per-session attempt counter sourced from scheduling history (`coordinator.py:756`). Documented as "give_up when >= max_retries."
- `SERVICE_MAX_RETRIES = 5 # FIXME: make configurable` (`manager/defs.py:121`): the global cap, used by both session and deployment coordinators (`coordinator.py:1228`, `deployment/coordinator.py:764`).

### Failure metadata

When a session fails, `SessionRow.status_data` carries `{"error": {"name": ..., "src": ...}}` per `manager/exceptions.py:convert_to_status_data` and the `ErrorStatusInfo` / `ErrorDetail` TypedDicts (line 97). The shape is stable.

### What is missing

A handler that fires on **terminal-failure** sessions, classifies the failure, and either rescheduples or accepts the failure. Today's handlers run on non-terminal sessions only.

## Proposed Design

### A new lifecycle handler: `RescheduleFailedBatchSessionsLifecycleHandler`

Lives at `sokovan/scheduler/handlers/lifecycle/reschedule_failed_batch.py`, alongside the existing handlers. Targets sessions where:

- `session_type == SessionTypes.BATCH`
- `status == ERROR`
- `phase_attempts < effective_max_retries`
- `status_data["error"]` classifies as a *reschedulable* cause (see "Classification" below).

Outcomes:

- **`success`** (rescheduling fired): transition `ERROR → PENDING`. Re-uses the existing `expired → PENDING` machinery, just from a new starting status. Increments `phase_attempts` via the standard scheduling-history append.
- **`give_up`** (cap reached, or cause not reschedulable): no transition. Session stays in `ERROR`.
- **`need_retry`** (transient inability to act, e.g., DB contention): no transition; handler retries next cycle.

The handler reuses **everything** the existing lifecycle handlers reuse: `phase_attempts` from scheduling history is the counter, `SERVICE_MAX_RETRIES` (now configurable, see below) is the cap, the lifecycle-coordinator path applies the transition. No new column on `SessionRow`. No queue table. No child sessions.

### Same session, not a child

A reschedule keeps the original `SessionRow` — same `id`, same `creation_id`, same kernels record, same resource allocation. The session re-enters `PENDING` with `phase_attempts` incremented; the scheduler picks a new agent on the next dispatch cycle. The kernels associated with the previous attempt are cleaned up as part of the terminal-state transition that already runs today.

This is intentionally different from the original BEP-1053 draft: there are no parent-child rows, no retry chain, no `parent_session_id`. The "history" of attempts is what scheduling history already records.

### Failure classification — extensible, not closed

A closed enum of causes hardcodes runtime behavior into code; site-specific failure signatures (vendor accelerator faults, registry-specific image-pull errors, custom-plugin failures) cannot be classified without a manager release. Replace the closed enum with a **pattern-based config**, loaded from etcd and refreshed via `EtcdConfigWatcher` (`manager/config/provider.py:20`):

```yaml
# config/manager/session_failure_classification
default: give_up
by_error_name:
OOMError: reschedule
AgentDisconnected: reschedule
ImagePullError: give_up # agent's tenacity already retried
HeartbeatTimeout: reschedule
ValidationError: give_up
QuotaExceededError: give_up
by_error_src:
agent: reschedule # fallback for agent-side errors not named above
```

Resolution order: `by_error_name` (most specific) → `by_error_src` → `default`. The result is one of three closed `Action` values: `reschedule`, `give_up`, or `ignore` (do not handle yet — leave for the next cycle, used rarely).

The **action catalog** stays a closed enum (the manager has to know what each action means), but the **cause catalog** is open: operators add patterns without code changes.

Hardcoded never-reschedulable causes: `USER_CANCELLED` (user intent), and any cause that originates *after* the session reached `RUNNING` and the user's script started — those are BEP-1053's domain. The handler short-circuits on these regardless of config.

### `SERVICE_MAX_RETRIES` becomes configurable

Same etcd path: `config/manager/scheduler_max_retries`. Read at startup, refreshed via `EtcdConfigWatcher`. Per-scaling-group overrides under `config/scaling-groups/{sg_name}/scheduler_max_retries`. Default `5` (matches current constant). The handler resolves the cap from scaling-group config first, then cluster, then default. Closes the standing `FIXME: make configurable`.

### Kill switch

`config/manager/reschedule_disabled` (etcd boolean, default `false`). Loaded at startup, watched. Checked at the top of the handler's per-cycle execution. When `true`, the handler is a no-op for that cycle. Useful for incident response (e.g., stop rescheduling cluster-wide during a cascade).

### Observability

- Counters: `bai_session_reschedule_attempted_total{cause}`, `bai_session_reschedule_capped_total{cause}` (cap reached), `bai_session_reschedule_succeeded_total` (subsequent attempt reached `RUNNING`).
- Event: `session.rescheduled` emitted when `ERROR → PENDING` transition fires. Reuses the existing event-publication path from the lifecycle coordinator.
- Audit log entry per reschedule: `(session_id, cause, attempt N of M, source_agent, target_after = scheduler_choice)`.
- The existing scheduling-history rows already record per-attempt timestamps and outcomes; that is the durable trail.

## Migration / Compatibility

### Backward compatibility

- Default `reschedule_disabled = false` *and* default classification config produces no `reschedule` actions for any cause. So **the feature is effectively off until an operator populates the classification config** — zero behavior change on rollout.
- All etcd keys are additive; no existing key changes shape.
- No Alembic migration required.
- `SERVICE_MAX_RETRIES` constant in `manager/defs.py:121` remains as the default if the etcd key is absent. The `FIXME` is closed; the constant becomes a fallback.

### Quota and accounting

A reschedule does not create a new `SessionRow`, so concurrent-session limits are unaffected. Resource consumption from the previous attempt is not refunded — the user *did* consume those resources on the failed node — but the next attempt re-uses the same allocation request, so quota is not double-counted.

### Interaction with BEP-1053

The two BEPs are designed to compose:

- **BEP-1053** runs first inside the failing kernel; non-zero exit → re-run script; only if all attempts fail does the agent emit `SessionFailureAnycastEvent`.
- **BEP-1054** then evaluates the resulting terminal-failure session. If the cause is node-level, the scheduler reschedules. If the cause is "user script failed after all in-place retries," the classification config maps it to `give_up` and the session stays terminal.

A session can therefore experience: agent-side script retries → manager-side reschedule → on a new node, agent-side script retries again. Each attempt's history is recorded in scheduling history; users see one logical job, operators see the full trail.

## Implementation Plan

Five PRs, each tracked under #11320:

1. **BEP draft** (this document and the companion BEP-1053) — #11321.
2. **Foundation:** `FailureClassifier` (pattern-based, etcd-driven, refreshed via `EtcdConfigWatcher`) and the `Action` enum. Pure logic, unit-test heavy.
3. **`SERVICE_MAX_RETRIES` configurability:** etcd source + per-scaling-group override + fallback to the `defs.py` constant. Closes the standing FIXME.
4. **Lifecycle handler:** `RescheduleFailedBatchSessionsLifecycleHandler`, kill switch, the `ERROR → PENDING` transition (extending the existing pattern to a new starting status), counters/events/audit.
5. **API surface:** session info responses include `reschedule_count` (= `phase_attempts` view) and the latest `reschedule_cause`. No mutation; this is read-only observability.
6. **Client:** SDK v2 + CLI v2 surface the new info fields; user docs.

Tests live with the code under test. Cross-cutting integration tests — node-level failure → reschedule → success on different agent; cap-reached → terminal; classification-config-empty → terminal; kill-switch-on → no rescheduling — ship with the lifecycle-handler PR. Estimated effort: two to three weeks for one engineer.

## References

- Companion: [BEP-1053 — Agent-level Batch Retry](BEP-1053-agent-batch-retry.md)
- Working draft and design pivot rationale: `docs/investigation/bep-1053-design-pivot.md`
- [BEP-1030: Sokovan Scheduler Status Transition Design](BEP-1030-sokovan-scheduler-status-transition.md)
- [BEP-1049: Zero-Downtime Deployment Strategy Architecture](BEP-1049-deployment-strategy-handler.md) — analogous handler-pattern for routes
Loading
Loading