diff --git a/changes/11322.doc.md b/changes/11322.doc.md new file mode 100644 index 00000000000..8c3f2ced091 --- /dev/null +++ b/changes/11322.doc.md @@ -0,0 +1 @@ +Add BEP-1053 (agent-level batch entrypoint retry) and BEP-1054 (session rescheduling on terminal failure) covering the two-tier batch resilience design — in-script retry stays on the agent; node-level failures reschedule the same session through the existing scheduler lifecycle handlers diff --git a/proposals/BEP-1053-agent-batch-retry.md b/proposals/BEP-1053-agent-batch-retry.md new file mode 100644 index 00000000000..91074f8620f --- /dev/null +++ b/proposals/BEP-1053-agent-batch-retry.md @@ -0,0 +1,131 @@ +--- +Author: Jeongseok Kang (jskang@lablup.com) +Status: Draft +Created: 2026-04-27 +Created-Version: 26.5.0 +Target-Version: +Implemented-Version: +--- + +# Agent-level Batch Retry + +## Related Issues + +- JIRA: BA-5851 +- GitHub Epic: #11320 +- GitHub: #11321 +- Companion BEP: [BEP-1054 — Session Rescheduling on Terminal Failure](BEP-1054-session-rescheduling-on-terminal-failure.md) + +## Motivation + +When a `BATCH` session's entrypoint exits non-zero, the session is marked failed and the user must manually re-submit. Most batch failures in practice are transient (a flaky network call, a downstream service hiccup, an intermittent dependency error) and a simple in-place re-run would have succeeded. Today the user pays the cost of re-creating the session — re-scheduling, re-pulling the image, re-mounting volumes — for a problem that is purely inside the script. + +This BEP adds a small **agent-side** knob: re-run the batch entrypoint inside the same kernel up to N times before reporting failure. It is the simpler, smaller half of the batch-retry feature; the companion BEP-1054 covers the case where the failure is at the *node* level and a fresh schedule is needed. + +### Goals + +- Opt-in retry of the batch entrypoint inside an existing kernel. +- No new manager-side state, tables, or events. +- Default `batch_retries = 0` keeps current behavior. +- Per-session knob; no policy framework needed at this layer. + +### Non-goals + +- Failures before the kernel is running (image pull, scheduling). Those go to BEP-1054. +- OOM and node-level failures. Re-running on the same node typically does not help; BEP-1054 handles them by rescheduling. +- A user-supplied retry-policy DSL with backoff and classification. Out of scope for v1; if needed, accrue evidence first and design separately. + +## Current Design + +The agent runs batch entrypoints in `Agent.execute_batch()` (`src/ai/backend/agent/agent.py:2406`). The path: + +1. Kernel reaches the running state. +2. If `kernel_obj.session_type == SessionTypes.BATCH` (`agent.py:2274`), the agent enqueues `execute_batch(session_id, kernel_id, startup_command, batch_timeout)` into `_ongoing_exec_batch_tasks` (line 840). +3. `execute_batch` invokes the kernel runner via `kernel.execute(...)` once. +4. On a non-zero exit code (or timeout), the agent emits `SessionFailureAnycastEvent` and `SessionFailureBroadcastEvent` (lines 2375, 2389, 2464, 2478, 2492). +5. On success, it emits `SessionSuccessAnycastEvent`/`SessionSuccessBroadcastEvent`. + +There is no in-script retry — the entrypoint runs exactly once per session. `RestartTracker` (line 757) handles *kernel* restart on agent crash recovery, not script re-execution. + +## Proposed Design + +### Knob + +Two new fields on the batch session creation request, plumbed through the existing kernel-config path that already carries `startup_command` and `batch_timeout`: + +| Field | Type | Default | Meaning | +|---|---|---|---| +| `batch_retries` | int (≥ 0) | `0` | Maximum number of additional `execute_batch` attempts after the first. Total attempts = `1 + batch_retries`. | +| `batch_retry_delay` | float seconds (≥ 0) | `0.0` | Wait between attempts. Constant; no backoff at this layer. | + +The two fields sit alongside `startup_command`, `bootstrap_script`, and `batch_timeout` in the session creation DTO. They are batch-only — the agent ignores them when `session_type != SessionTypes.BATCH`. + +### Execution loop + +`execute_batch` becomes: + +```python +async def execute_batch(self, session_id, kernel_id, startup_command, batch_timeout, + batch_retries: int = 0, batch_retry_delay: float = 0.0): + last_exit_code: int | None = None + for attempt in range(batch_retries + 1): + if attempt > 0: + log.info("execute_batch(k:{}) retry attempt {}/{}", kernel_id, attempt, batch_retries) + await asyncio.sleep(batch_retry_delay) + last_exit_code = await self._run_batch_once(session_id, kernel_id, startup_command, batch_timeout) + if last_exit_code == 0: + await self._emit_session_success(session_id, kernel_id) + return + # else: non-zero exit -> retry if attempts remain + # exhausted + await self._emit_session_failure(session_id, kernel_id, last_exit_code) +``` + +Only **non-zero exit codes** trigger a retry. Cancellation, timeout, and infrastructure errors (kernel disconnect, container crash) do **not** loop here: +- Cancellation propagates as today. +- Timeout (`KernelLifecycleEventReason.TASK_TIMEOUT`, `agent.py:2492`) emits failure as today; rerunning a script that already ran past `batch_timeout` is unhelpful. +- Container-level failures escalate to BEP-1054's domain. + +### Observability + +- `bai_agent_batch_retry_attempted_total{session_id_type=batch}` counter (per attempt beyond the first). +- `bai_agent_batch_retry_succeeded_total` counter (incremented when a retry attempt exits zero). +- `bai_agent_batch_retry_exhausted_total` counter (incremented when the loop ends with non-zero). +- Each retry attempt logged at INFO with `(kernel_id, attempt, max_attempts)`. +- The existing failure event is emitted only on final exhaustion; no new event types. + +### What does **not** change + +- Session lifecycle, statuses, or transitions. +- Manager-side handlers (`SessionEventHandler`, sokovan). +- Database schema. +- `creation_id`, `parent_session_id` (does not exist), retry chain (does not exist). +- API surface beyond the two new fields on the create request. + +The only manager-side change is plumbing `batch_retries` and `batch_retry_delay` from the create request into the kernel config payload that the agent already receives. + +## Migration / Compatibility + +- Default `batch_retries = 0` preserves current behavior for every existing caller. +- New fields are additive on the create request and on responses (echoed back for visibility). +- No Alembic migration required. +- Operators have a per-session opt-out by leaving the field unset; no global kill switch needed because the feature is opt-in. + +## Implementation Plan + +Two PRs: + +1. **BEP draft** (this document) plus the companion BEP-1054 — #11321. +2. **Agent change:** extend `execute_batch` with the retry loop, plumb `batch_retries`/`batch_retry_delay` from kernel config, add metrics, unit tests around the loop semantics. +3. **Client surface:** SDK v2 + CLI v2 accept the two new fields on `./bai session create -t batch`. REST v2 / GraphQL v2 echo them on session info responses. + +Tests live with the code under test. The agent's batch executor has existing test scaffolding; the loop is the smallest possible delta. + +Estimated effort: under one week for one engineer, given the constrained scope. + +## References + +- Companion: [BEP-1054 — Session Rescheduling on Terminal Failure](BEP-1054-session-rescheduling-on-terminal-failure.md) +- Working draft of the prior single-BEP design and the pivot rationale: `docs/investigation/bep-1053-design-pivot.md` +- Apache Airflow's `retries` parameter (the inspirational reference): `airflow-core/src/airflow/models/taskinstance.py:1109-1159` +- [BEP-1030: Sokovan Scheduler Status Transition Design](BEP-1030-sokovan-scheduler-status-transition.md) diff --git a/proposals/BEP-1054-session-rescheduling-on-terminal-failure.md b/proposals/BEP-1054-session-rescheduling-on-terminal-failure.md new file mode 100644 index 00000000000..44a07e5d60b --- /dev/null +++ b/proposals/BEP-1054-session-rescheduling-on-terminal-failure.md @@ -0,0 +1,169 @@ +--- +Author: Jeongseok Kang (jskang@lablup.com) +Status: Draft +Created: 2026-04-27 +Created-Version: 26.5.0 +Target-Version: +Implemented-Version: +--- + +# Session Rescheduling on Terminal Failure + +## Related Issues + +- JIRA: BA-5851 +- GitHub Epic: #11320 +- GitHub: #11321 +- Companion BEP: [BEP-1053 — Agent-level Batch Retry](BEP-1053-agent-batch-retry.md) + +## Motivation + +Some session failures are **node-level**: the kernel was OOM-killed on this host, the agent disconnected mid-run, the registry route used by this scaling group is briefly down, the network namespace setup failed for a node-specific reason. For these cases, re-running the script in place — Backend.AI's existing scheduler-internal retries, or BEP-1053's agent-level batch retry — does not help. What does help is **rescheduling the same session to a different node**, with the same resource allocation. + +Today, terminal-failure sessions stay terminal. There is no path that takes a session in `ERROR` and pushes it back through the scheduler. Operators have to ask users to re-create their sessions, often after diagnosing that the failure was the host's fault, not the user's. This BEP closes that gap. + +It is the companion to [BEP-1053](BEP-1053-agent-batch-retry.md), which handles in-script retry; together they cover the two distinct retry surfaces. They are designed to ship independently. + +### Goals + +- Re-dispatch a terminal-failed `BATCH` session through the scheduler when the failure is classified as **node-level**. +- Reuse existing scheduler infrastructure: `SessionLifecycleHandler`, `phase_attempts`, scheduling history, the `expired → PENDING` transition pattern. +- Make failure classification **operator-extensible** — etcd-driven pattern config, not a closed enum in code. +- Promote the standing `SERVICE_MAX_RETRIES = 5 # FIXME: make configurable` (`manager/defs.py:121`) to a real configuration knob as a side effect. +- Default off; opt-in per scaling group. + +### Non-goals + +- Mutating resource allocation (no "give it more memory and retry"). Resource decisions stay with the user/admin. +- User-facing per-session `RetryPolicy` with backoff/jitter/max. Rescheduling is operator-policy, not user-policy. +- Interactive or inference sessions. INTERACTIVE is user-driven; INFERENCE has BEP-1049 deployment-route handling. +- Re-running the user script in place. That is BEP-1053's job. + +## Current Design + +### Session lifecycle and terminal status + +`SessionStatus` (`src/ai/backend/manager/data/session/types.py:30-50`) defines the lifecycle. `terminal_statuses()` (line 109) is `{ERROR, TERMINATED, CANCELLED}` — no transitions out today. `retriable_statuses()` (line 118) is the scheduler's *in-session* retriable set; it does not apply to sessions already in `ERROR`. + +### Sokovan lifecycle handlers + +Periodic `SessionLifecycleHandler`s drive scheduler decisions (`sokovan/scheduler/handlers/`). Each declares `success / need_retry / expired / give_up` outcomes and the status transitions for each (`base.py:62-93`). Existing handlers include `CheckPreconditionLifecycleHandler` and `StartSessionsLifecycleHandler`, which use the **`expired → PENDING`** transition pattern (`check_precondition.py:67`, `start_sessions.py:78`) — the canonical "re-schedule this session" mechanism, scoped today to startup-stage timeouts. + +### Existing counters and caps + +- `phase_attempts` (`sokovan/data/lifecycle.py:322`): per-session attempt counter sourced from scheduling history (`coordinator.py:756`). Documented as "give_up when >= max_retries." +- `SERVICE_MAX_RETRIES = 5 # FIXME: make configurable` (`manager/defs.py:121`): the global cap, used by both session and deployment coordinators (`coordinator.py:1228`, `deployment/coordinator.py:764`). + +### Failure metadata + +When a session fails, `SessionRow.status_data` carries `{"error": {"name": ..., "src": ...}}` per `manager/exceptions.py:convert_to_status_data` and the `ErrorStatusInfo` / `ErrorDetail` TypedDicts (line 97). The shape is stable. + +### What is missing + +A handler that fires on **terminal-failure** sessions, classifies the failure, and either rescheduples or accepts the failure. Today's handlers run on non-terminal sessions only. + +## Proposed Design + +### A new lifecycle handler: `RescheduleFailedBatchSessionsLifecycleHandler` + +Lives at `sokovan/scheduler/handlers/lifecycle/reschedule_failed_batch.py`, alongside the existing handlers. Targets sessions where: + +- `session_type == SessionTypes.BATCH` +- `status == ERROR` +- `phase_attempts < effective_max_retries` +- `status_data["error"]` classifies as a *reschedulable* cause (see "Classification" below). + +Outcomes: + +- **`success`** (rescheduling fired): transition `ERROR → PENDING`. Re-uses the existing `expired → PENDING` machinery, just from a new starting status. Increments `phase_attempts` via the standard scheduling-history append. +- **`give_up`** (cap reached, or cause not reschedulable): no transition. Session stays in `ERROR`. +- **`need_retry`** (transient inability to act, e.g., DB contention): no transition; handler retries next cycle. + +The handler reuses **everything** the existing lifecycle handlers reuse: `phase_attempts` from scheduling history is the counter, `SERVICE_MAX_RETRIES` (now configurable, see below) is the cap, the lifecycle-coordinator path applies the transition. No new column on `SessionRow`. No queue table. No child sessions. + +### Same session, not a child + +A reschedule keeps the original `SessionRow` — same `id`, same `creation_id`, same kernels record, same resource allocation. The session re-enters `PENDING` with `phase_attempts` incremented; the scheduler picks a new agent on the next dispatch cycle. The kernels associated with the previous attempt are cleaned up as part of the terminal-state transition that already runs today. + +This is intentionally different from the original BEP-1053 draft: there are no parent-child rows, no retry chain, no `parent_session_id`. The "history" of attempts is what scheduling history already records. + +### Failure classification — extensible, not closed + +A closed enum of causes hardcodes runtime behavior into code; site-specific failure signatures (vendor accelerator faults, registry-specific image-pull errors, custom-plugin failures) cannot be classified without a manager release. Replace the closed enum with a **pattern-based config**, loaded from etcd and refreshed via `EtcdConfigWatcher` (`manager/config/provider.py:20`): + +```yaml +# config/manager/session_failure_classification +default: give_up +by_error_name: + OOMError: reschedule + AgentDisconnected: reschedule + ImagePullError: give_up # agent's tenacity already retried + HeartbeatTimeout: reschedule + ValidationError: give_up + QuotaExceededError: give_up +by_error_src: + agent: reschedule # fallback for agent-side errors not named above +``` + +Resolution order: `by_error_name` (most specific) → `by_error_src` → `default`. The result is one of three closed `Action` values: `reschedule`, `give_up`, or `ignore` (do not handle yet — leave for the next cycle, used rarely). + +The **action catalog** stays a closed enum (the manager has to know what each action means), but the **cause catalog** is open: operators add patterns without code changes. + +Hardcoded never-reschedulable causes: `USER_CANCELLED` (user intent), and any cause that originates *after* the session reached `RUNNING` and the user's script started — those are BEP-1053's domain. The handler short-circuits on these regardless of config. + +### `SERVICE_MAX_RETRIES` becomes configurable + +Same etcd path: `config/manager/scheduler_max_retries`. Read at startup, refreshed via `EtcdConfigWatcher`. Per-scaling-group overrides under `config/scaling-groups/{sg_name}/scheduler_max_retries`. Default `5` (matches current constant). The handler resolves the cap from scaling-group config first, then cluster, then default. Closes the standing `FIXME: make configurable`. + +### Kill switch + +`config/manager/reschedule_disabled` (etcd boolean, default `false`). Loaded at startup, watched. Checked at the top of the handler's per-cycle execution. When `true`, the handler is a no-op for that cycle. Useful for incident response (e.g., stop rescheduling cluster-wide during a cascade). + +### Observability + +- Counters: `bai_session_reschedule_attempted_total{cause}`, `bai_session_reschedule_capped_total{cause}` (cap reached), `bai_session_reschedule_succeeded_total` (subsequent attempt reached `RUNNING`). +- Event: `session.rescheduled` emitted when `ERROR → PENDING` transition fires. Reuses the existing event-publication path from the lifecycle coordinator. +- Audit log entry per reschedule: `(session_id, cause, attempt N of M, source_agent, target_after = scheduler_choice)`. +- The existing scheduling-history rows already record per-attempt timestamps and outcomes; that is the durable trail. + +## Migration / Compatibility + +### Backward compatibility + +- Default `reschedule_disabled = false` *and* default classification config produces no `reschedule` actions for any cause. So **the feature is effectively off until an operator populates the classification config** — zero behavior change on rollout. +- All etcd keys are additive; no existing key changes shape. +- No Alembic migration required. +- `SERVICE_MAX_RETRIES` constant in `manager/defs.py:121` remains as the default if the etcd key is absent. The `FIXME` is closed; the constant becomes a fallback. + +### Quota and accounting + +A reschedule does not create a new `SessionRow`, so concurrent-session limits are unaffected. Resource consumption from the previous attempt is not refunded — the user *did* consume those resources on the failed node — but the next attempt re-uses the same allocation request, so quota is not double-counted. + +### Interaction with BEP-1053 + +The two BEPs are designed to compose: + +- **BEP-1053** runs first inside the failing kernel; non-zero exit → re-run script; only if all attempts fail does the agent emit `SessionFailureAnycastEvent`. +- **BEP-1054** then evaluates the resulting terminal-failure session. If the cause is node-level, the scheduler reschedules. If the cause is "user script failed after all in-place retries," the classification config maps it to `give_up` and the session stays terminal. + +A session can therefore experience: agent-side script retries → manager-side reschedule → on a new node, agent-side script retries again. Each attempt's history is recorded in scheduling history; users see one logical job, operators see the full trail. + +## Implementation Plan + +Five PRs, each tracked under #11320: + +1. **BEP draft** (this document and the companion BEP-1053) — #11321. +2. **Foundation:** `FailureClassifier` (pattern-based, etcd-driven, refreshed via `EtcdConfigWatcher`) and the `Action` enum. Pure logic, unit-test heavy. +3. **`SERVICE_MAX_RETRIES` configurability:** etcd source + per-scaling-group override + fallback to the `defs.py` constant. Closes the standing FIXME. +4. **Lifecycle handler:** `RescheduleFailedBatchSessionsLifecycleHandler`, kill switch, the `ERROR → PENDING` transition (extending the existing pattern to a new starting status), counters/events/audit. +5. **API surface:** session info responses include `reschedule_count` (= `phase_attempts` view) and the latest `reschedule_cause`. No mutation; this is read-only observability. +6. **Client:** SDK v2 + CLI v2 surface the new info fields; user docs. + +Tests live with the code under test. Cross-cutting integration tests — node-level failure → reschedule → success on different agent; cap-reached → terminal; classification-config-empty → terminal; kill-switch-on → no rescheduling — ship with the lifecycle-handler PR. Estimated effort: two to three weeks for one engineer. + +## References + +- Companion: [BEP-1053 — Agent-level Batch Retry](BEP-1053-agent-batch-retry.md) +- Working draft and design pivot rationale: `docs/investigation/bep-1053-design-pivot.md` +- [BEP-1030: Sokovan Scheduler Status Transition Design](BEP-1030-sokovan-scheduler-status-transition.md) +- [BEP-1049: Zero-Downtime Deployment Strategy Architecture](BEP-1049-deployment-strategy-handler.md) — analogous handler-pattern for routes diff --git a/proposals/README.md b/proposals/README.md index b0024efe64e..590085e2753 100644 --- a/proposals/README.md +++ b/proposals/README.md @@ -123,6 +123,8 @@ BEP numbers start from 1000. | [1050](BEP-1050-prometheus-query-preset-system.md) | Prometheus Query Preset System | BoKeum Kim | Draft | | [1051](BEP-1051-kata-containers-agent.md) | Kata Containers Agent Backend | Kyujin Cho | Draft | | [1052](BEP-1052-scoped-app-config-redesign.md) | Scoped App Config Redesign | Gyubong Lee | Draft | +| [1053](BEP-1053-agent-batch-retry.md) | Agent-level Batch Retry | Jeongseok Kang | Draft | +| [1054](BEP-1054-session-rescheduling-on-terminal-failure.md) | Session Rescheduling on Terminal Failure | Jeongseok Kang | Draft | | _next_ | _(reserve your number here)_ | | | ## File Structure