From d1360ffe3a35be66bfdf749cb975677760c2959b Mon Sep 17 00:00:00 2001
From: Jeongseok Kang <jskang@lablup.com>
Date: Mon, 27 Apr 2026 10:17:45 +0900
Subject: [PATCH 1/9] docs: add BEP-1053 native session retry proposal

Captures the design for adding native session-level retry to Backend.AI
core, modeled after Apache Airflow's RetryPolicy parameter surface and
adapted to Backend.AI's event-driven architecture. Subsequent
implementation work tracked under epic.

Refs: #11320, #11321

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 proposals/BEP-1053-native-session-retry.md | 263 +++++++++++++++++++++
 proposals/README.md                        |   1 +
 2 files changed, 264 insertions(+)
 create mode 100644 proposals/BEP-1053-native-session-retry.md

diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md
new file mode 100644
index 00000000000..597634863e0
--- /dev/null
+++ b/proposals/BEP-1053-native-session-retry.md
@@ -0,0 +1,263 @@
+---
+Author: Jeongseok Kang (jskang@lablup.com)
+Status: Draft
+Created: 2026-04-27
+Created-Version: 26.5.0
+Target-Version:
+Implemented-Version:
+---
+
+# Native Session Retry
+
+## Related Issues
+
+- GitHub Epic: #11320
+- GitHub: #11321
+
+## Motivation
+
+Backend.AI core has no session-level retry. A `BATCH` session that fails — image pull error, transient agent failure, OOM, scheduler timeout, kernel non-zero exit — becomes terminal in `ERROR`, and the user must manually re-create it.
+
+The only retry-shaped logic in core today is infrastructure-level: DB transaction retry (`account_manager/models/utils.py`), kernel restart on the agent (`agent/agent.py:restarting_kernels`), and `tenacity`-wrapped HTTP/socket retries. None of these handle "the session as a whole failed; create a fresh one with the same spec."
+
+This pushes the retry concern out to every higher-level orchestrator on top of Backend.AI. Each one re-implements the same logic, with inconsistent semantics. Pushing retry into core gives:
+
+- A single source of truth for retry semantics — backoff, jitter, eligibility — shared by every caller.
+- Resilience for plain batch workloads without requiring an external orchestrator.
+- Reduced duplication; orchestrators above Backend.AI can thin out their retry layers.
+
+## Current Design
+
+Session statuses are defined in `src/ai/backend/manager/data/session/types.py:30-51`:
+
+```
+PENDING → SCHEDULED → PREPARING → PULLING → PREPARED → CREATING → RUNNING → TERMINATING → TERMINATED
+```
+
+Terminal statuses with no further transitions: `ERROR`, `TERMINATED`, `CANCELLED`. `SessionStatus.retriable_statuses()` (line 118) classifies which startup states are scheduling-retriable, but there is no notion of *re-creating* a terminal `ERROR` session.
+
+Session creation flows through `API handler → SessionService.create_from_params() → repository → SessionRow`. `SessionRow.creation_id` already exists as an idempotency key. There are no fields for `parent_session_id`, `retry_count`, `max_retries`, or a retry policy.
+
+The termination event handler (`event_dispatcher/handlers/session.py`) listens to `session.terminated` / `session.error` but has no retry decision hook.
+
+No prior BEP covers session retry or fault tolerance.
+
+## Proposed Design
+
+### Mental model
+
+`max_retries > 0` means "retry on failure." Users should not need to opt in twice. Apache Airflow takes the same stance — any non-fatal exception triggers retry up to `retries`. The classification's job is only to exclude failures that semantically must not retry (cancellation, validation, quota), not to gate ordinary failure modes.
+
+### `RetryPolicy` schema
+
+A Pydantic DTO accepted at session creation, modeled on Airflow's parameter surface:
+
+```python
+class BackoffStrategy(StrEnum):
+    FIXED = "fixed"
+    EXPONENTIAL = "exponential"
+
+class JitterMode(StrEnum):
+    NONE = "none"
+    DETERMINISTIC = "deterministic"
+    RANDOM = "random"
+
+class RetryEligibleCause(StrEnum):
+    AGENT_TRANSIENT = "agent_transient"
+    SCHEDULER_TIMEOUT = "scheduler_timeout"
+    IMAGE_PULL_FAILURE = "image_pull_failure"
+    KERNEL_NONZERO_EXIT = "kernel_nonzero_exit"
+    OOM_KILLED = "oom_killed"
+    UNKNOWN = "unknown"
+
+    @classmethod
+    def defaults(cls) -> frozenset["RetryEligibleCause"]:
+        return frozenset({
+            cls.AGENT_TRANSIENT, cls.SCHEDULER_TIMEOUT,
+            cls.IMAGE_PULL_FAILURE, cls.KERNEL_NONZERO_EXIT,
+            cls.OOM_KILLED, cls.UNKNOWN,
+        })
+
+class RetryPolicy(BaseModel):
+    max_retries: NonNegativeInt = 0
+    retry_delay: PositiveFloat = 60.0
+    backoff: BackoffStrategy = BackoffStrategy.FIXED
+    backoff_multiplier: PositiveFloat = 2.0
+    max_retry_delay: PositiveFloat | None = 3600.0
+    jitter: JitterMode = JitterMode.DETERMINISTIC
+    jitter_ratio: confloat(ge=0, le=1) = 0.25
+    eligible_causes: frozenset[RetryEligibleCause] = Field(
+        default_factory=RetryEligibleCause.defaults
+    )
+    emit_retry_events: bool = True
+```
+
+Mapping to Airflow:
+
+| Airflow | `RetryPolicy` |
+|---|---|
+| `retries` | `max_retries` (count, total attempts = `1 + max_retries`) |
+| `retry_delay` | `retry_delay` (seconds) |
+| `retry_exponential_backoff` (multiplier) | `backoff: fixed\|exponential` + `backoff_multiplier` |
+| `max_retry_delay` (with 24 h hard ceiling) | `max_retry_delay` (24 h hard ceiling preserved) |
+| SHA1-deterministic jitter | `jitter` (selectable: none / deterministic / random), `jitter_ratio` |
+| Exception-typed eligibility | Structural enum `RetryEligibleCause` |
+| `on_retry_callback` | `session.retry_scheduled` / `session.retry_exhausted` events |
+| `default_args` precedence | Per-session > project/domain default > etcd cluster default |
+| `email_on_retry` | Subsumed by event subscription via webhook plugin |
+
+Deviations from Airflow and their reasons:
+
+- **No callback parameter.** Keeps the policy serializable and the server's behavior auditable. Backend.AI is event-driven; downstream consumers subscribe to `session.retry_*` events.
+- **Structural cause enum, not exception types.** Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process.
+- **`max_retries` is a count.** Total attempts = `1 + max_retries`, matching Backend.AI conventions and the existing pipeline orchestrator.
+
+### Failure classification
+
+A central `classify_failure(session, status_data) → RetryEligibleCause`. Hardcoded non-retriable causes outside the enum: `USER_CANCELLED`, `VALIDATION_ERROR`, `QUOTA_EXCEEDED`. Users cannot opt these into retry.
+
+| Cause | Default eligible | Notes |
+|---|---|---|
+| `AGENT_TRANSIENT` | yes | Lost heartbeat, agent restart mid-run. |
+| `SCHEDULER_TIMEOUT` | yes | Kernel-creation timeout under cluster pressure. |
+| `IMAGE_PULL_FAILURE` | yes | Typo wastes a few seconds with backoff; registry blip is real. |
+| `KERNEL_NONZERO_EXIT` | yes | The most common reason batch users want retry. |
+| `OOM_KILLED` | yes | Retry without resource bump usually fails again, but exhausting `max_retries` is cheap. |
+| `UNKNOWN` | yes | Conservative for unclassified failures. |
+| `USER_CANCELLED` | hardcoded never | Permanent. |
+| `VALIDATION_ERROR` / `QUOTA_EXCEEDED` | hardcoded never | Permanent. |
+
+### Backoff formula
+
+```
+base = retry_delay                                                    if backoff == FIXED
+       min(retry_delay * backoff_multiplier ** retry_count,            otherwise
+           max_retry_delay or MAX_RETRY_DELAY)
+delay = apply_jitter(base, mode=jitter, ratio=jitter_ratio,
+                     seed=(session_id, retry_count))
+delay = min(delay, max_retry_delay or MAX_RETRY_DELAY)
+```
+
+`MAX_RETRY_DELAY` is a hard 24 h ceiling. Deterministic jitter takes `SHA1(session_id || retry_count) mod (base * jitter_ratio)`, yielding reproducible delays — useful for tests. Random jitter samples uniformly in `[base, base * (1 + jitter_ratio))`.
+
+### Defaults precedence
+
+Three layers, matching Airflow's `default_args` propagation:
+
+1. Per-session policy in the create request.
+2. Project / domain default (new optional field, admin-managed).
+3. Cluster default in etcd: `config/manager/retry_policy_default`. Ship default: `max_retries=0` → no behavior change.
+
+Effective policy = deep-merge top-down; per-session wins.
+
+### Data model
+
+One Alembic migration adds to `sessions`:
+
+```
+parent_session_id : UUID NULL  (self-FK)
+retry_count       : INT  NOT NULL DEFAULT 0
+max_retries       : INT  NOT NULL DEFAULT 0
+retry_policy      : JSONB NULL
+retry_cause       : TEXT NULL
+```
+
+Rationale: `parent_session_id`, `retry_count`, `max_retries` are first-class columns because they are queried for filters and joins. The rest live in JSONB. **No new history table** — the chain is a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. Cheaper than a separate history table and consistent with Backend.AI's existing model.
+
+### Decision and dispatch
+
+A new handler at `event_dispatcher/handlers/session_retry.py` subscribes to `session.terminated` / `session.error`:
+
+1. Load session. If `retry_count >= max_retries` → emit `session.retry_exhausted` and return.
+2. Classify failure. If cause not in `eligible_causes` (or in hardcoded never-retry set) → return.
+3. Acquire row lock with `select_for_update()`. If a child with deterministic `creation_id = parent.creation_id + ":retry:" + (retry_count + 1)` already exists → return (idempotency).
+4. Compute `delay` per the formula above.
+5. Schedule retry creation through the existing background task / event mechanism with the computed delay. Do not block the handler on a sleep.
+
+The retry path calls `SessionService.create_from_params()` with a `CreateFromParamsAction` derived from the parent (image, mounts, resource_slots, env, cluster spec, batch entrypoint). The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`.
+
+**No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` field on the API tells clients "attempt N of M" / "this session has a pending child." This avoids touching the scheduler state machine.
+
+### API surface
+
+REST v2 (`api/rest/v2/sessions/`):
+
+- `POST /sessions` — accept optional `retry_policy` in the request body.
+- `GET /sessions/{id}` — return `parent_session_id`, `retry_count`, `max_retries`, `retry_policy`, `retry_cause`, plus computed `retry_chain` (oldest → newest IDs).
+- `GET /sessions/{id}/attempts` — return the chain with status of each attempt.
+
+GraphQL v2: mirror in `api/gql/session/types.py` — `parentSession`, `retryCount`, `maxRetries`, `retryPolicy`, `retryCause`, resolver `retryChain`.
+
+Client SDK v2 + CLI v2: expose new fields; `./bai session info` shows `attempt N of M` and links to the parent.
+
+**No retry mutation in v1.** Manual retry is deferred until the auto path is stable.
+
+### Observability
+
+- Counters: `bai_session_retry_scheduled_total{cause}`, `bai_session_retry_exhausted_total{cause}`, `bai_session_retry_succeeded_total`.
+- Events: `session.retry_scheduled`, `session.retry_exhausted` — consumable by the webhook plugin. Replace the role of Airflow's `on_retry_callback` for downstream consumers.
+- Audit log entry per retry dispatch (auto, cause, attempt N of M).
+
+## Migration / Compatibility
+
+### Backward compatibility
+
+- Default `max_retries=0` ⇒ zero behavior change for existing callers.
+- All new columns are nullable or default to safe zero values.
+- Existing GraphQL and REST clients continue to work; new fields are additive.
+
+### Migration steps
+
+1. Apply Alembic migration adding the five columns. Migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`.
+2. Deploy manager with retry handler and surface, default off via etcd.
+3. Operators opt in by setting cluster default or per-session policy.
+4. External orchestrators may continue using their own retry layers; migration to native retry is independent and incremental.
+
+### Breaking changes
+
+None.
+
+## Implementation Plan
+
+Six PRs, each tracked by its own sub-issue under #11320:
+
+1. **BEP draft** (this document) — #11321.
+2. **Foundation:** `RetryPolicy` DTO, `classify_failure` module, backoff utility (with deterministic jitter). Pure functions, no I/O, unit-test heavy.
+3. **Schema:** Alembic migration, `SessionRow` field expansion, repository read/write for retry chain. Backportable.
+4. **Retry engine:** event handler, `SessionService.create_from_params` extension, defaults precedence (project/domain/etcd), counters/events/audit.
+5. **API surface:** REST v2 and GraphQL v2 fields, `attempts` endpoint.
+6. **Client:** SDK v2, CLI v2 (`./bai session info` retry view), user docs.
+
+Tests live with the code under test. Cross-cutting integration tests (transient → retry → success; exhaustion path; concurrent dispatch idempotency; jitter determinism) ship with the retry-engine PR.
+
+Estimated effort: three to four weeks for one engineer.
+
+## Decision Log
+
+| Date | Decision | Rationale |
+|------|----------|-----------|
+| 2026-04-27 | Batch sessions only in v1 | Interactive sessions are user-driven and do not fit auto-retry semantics. |
+| 2026-04-27 | Each retry is a fresh session, linked via `parent_session_id` | Matches existing pipeline orchestrator semantics; avoids reusing kernels/scratch and the complexity that would entail. |
+| 2026-04-27 | No new `RETRYING` status | Parent goes to `ERROR`, child starts `PENDING` — avoids touching the scheduler state machine. Computed `retry_state` on the API is enough for clients. |
+| 2026-04-27 | Linked-list chain, not a separate history table | The chain is already a list of real `SessionRow`s; no need to duplicate. |
+| 2026-04-27 | Structural `RetryEligibleCause` enum, not exception-typed | Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process. |
+| 2026-04-27 | `KERNEL_NONZERO_EXIT` is in the default eligible set | `max_retries > 0` should be the only knob a typical user touches; matches Airflow's "retry on failure, period" model. |
+| 2026-04-27 | `USER_CANCELLED` / `VALIDATION_ERROR` / `QUOTA_EXCEEDED` are hardcoded non-retriable | These are permanent by definition; users cannot opt them into retry. |
+| 2026-04-27 | No retry mutation in v1 | Auto path stabilizes first; manual retry's interaction with `max_retries` is itself a design decision. |
+| 2026-04-27 | Idempotency via deterministic child `creation_id` | Reuses an existing field; no new uniqueness constraint required. |
+| 2026-04-27 | Deterministic jitter seed = `(session_id, retry_count)` | Reproducible for tests; trade-off vs. unpredictability is acceptable for a server-side retry. |
+
+## Open Questions
+
+- Quota accounting: do retries count against concurrent-session limits? Likely yes, but needs a product call.
+- Retry-storm kill switch: should the etcd default be a single boolean toggle, a rate limit, or both? Leaning toward a boolean for v1 with a rate limit deferred.
+- Manual retry in v2: counts toward `max_retries` or independent? Decide before exposing.
+- Default for `max_retry_delay`: 1 h is conservative for long-running batch jobs that might benefit from a longer cooldown after repeated failures. Revisit after telemetry.
+- Project/domain defaults table location: extend an existing table or add a small new `project_retry_defaults` table?
+
+## References
+
+- Working draft: `docs/investigation/native-session-retry-plan.md`
+- Apache Airflow retry implementation: `airflow-core/src/airflow/models/taskinstance.py:1109-1159`
+- Existing scheduler state-machine BEP: [BEP-1030](BEP-1030-sokovan-scheduler-status-transition.md)
+- Alembic backport strategy: `src/ai/backend/manager/models/alembic/README.md`
diff --git a/proposals/README.md b/proposals/README.md
index b0024efe64e..f93b0b63f5b 100644
--- a/proposals/README.md
+++ b/proposals/README.md
@@ -123,6 +123,7 @@ BEP numbers start from 1000.
 | [1050](BEP-1050-prometheus-query-preset-system.md) | Prometheus Query Preset System | BoKeum Kim | Draft |
 | [1051](BEP-1051-kata-containers-agent.md) | Kata Containers Agent Backend | Kyujin Cho | Draft |
 | [1052](BEP-1052-scoped-app-config-redesign.md) | Scoped App Config Redesign | Gyubong Lee | Draft |
+| [1053](BEP-1053-native-session-retry.md) | Native Session Retry | Jeongseok Kang | Draft |
 | _next_ | _(reserve your number here)_ | | |
 
 ## File Structure

From ed3c61a3a2dfb6f2f9c47caa6a484eda69dee528 Mon Sep 17 00:00:00 2001
From: Jeongseok Kang <jskang@lablup.com>
Date: Mon, 27 Apr 2026 10:21:01 +0900
Subject: [PATCH 2/9] docs(BA-5851): add news fragment for BEP-1053

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 changes/11322.doc.md | 1 +
 1 file changed, 1 insertion(+)
 create mode 100644 changes/11322.doc.md

diff --git a/changes/11322.doc.md b/changes/11322.doc.md
new file mode 100644
index 00000000000..4c246d3f07e
--- /dev/null
+++ b/changes/11322.doc.md
@@ -0,0 +1 @@
+Add BEP-1053 proposing native session-level retry for batch sessions, with a `RetryPolicy` schema modeled after Apache Airflow and adapted to Backend.AI's event-driven model

From 1404ebdd161675e4eebede223c2b80f9081dc881 Mon Sep 17 00:00:00 2001
From: Jeongseok Kang <jskang@lablup.com>
Date: Mon, 27 Apr 2026 11:10:02 +0900
Subject: [PATCH 3/9] docs(BA-5851): refresh BEP-1053 against current main

Verified code paths against latest main:
- SessionStatus enum at data/session/types.py:30-50; terminal_statuses() at line 109
- SessionEventHandler at event_dispatcher/handlers/session.py:52, with handlers
  for started/cancelled/terminating/terminated; status_data["error"] already
  consulted in handle_session_terminated
- SessionService.create_from_params at services/session/service.py:255
- SessionRow.creation_id at models/session/row.py:389-390

Drop redundant tables (Decision Log, Airflow-mapping) and the Open Questions
section to match the format used by recent BEPs (BEP-1049, BEP-1050).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 proposals/BEP-1053-native-session-retry.md | 198 +++++++++------------
 1 file changed, 89 insertions(+), 109 deletions(-)

diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md
index 597634863e0..bc40ec329cd 100644
--- a/proposals/BEP-1053-native-session-retry.md
+++ b/proposals/BEP-1053-native-session-retry.md
@@ -11,46 +11,72 @@ Implemented-Version:
 
 ## Related Issues
 
+- JIRA: BA-5851
 - GitHub Epic: #11320
 - GitHub: #11321
 
 ## Motivation
 
-Backend.AI core has no session-level retry. A `BATCH` session that fails — image pull error, transient agent failure, OOM, scheduler timeout, kernel non-zero exit — becomes terminal in `ERROR`, and the user must manually re-create it.
+Backend.AI core has no session-level retry. A `BATCH` session that fails — image pull error, transient agent failure, OOM, scheduler timeout, kernel non-zero exit — becomes terminal in `ERROR`, and the user must manually re-create it. The only retry-shaped logic in core today is infrastructure-level: DB transaction retry (`account_manager/models/utils.py:execute_with_txn_retry`), kernel restart on the agent (`agent.py:RestartTracker`), and `tenacity`-wrapped HTTP/socket retries. None of these handle "the session as a whole failed; create a fresh one with the same spec."
 
-The only retry-shaped logic in core today is infrastructure-level: DB transaction retry (`account_manager/models/utils.py`), kernel restart on the agent (`agent/agent.py:restarting_kernels`), and `tenacity`-wrapped HTTP/socket retries. None of these handle "the session as a whole failed; create a fresh one with the same spec."
+The retry concern is therefore pushed to every higher-level orchestrator on top of Backend.AI, each of which re-implements the same logic with inconsistent semantics. Lifting retry into core gives one source of truth, resilience for plain batch workloads, and lets orchestrators thin out their own retry layers.
 
-This pushes the retry concern out to every higher-level orchestrator on top of Backend.AI. Each one re-implements the same logic, with inconsistent semantics. Pushing retry into core gives:
+### Goals
 
-- A single source of truth for retry semantics — backoff, jitter, eligibility — shared by every caller.
-- Resilience for plain batch workloads without requiring an external orchestrator.
-- Reduced duplication; orchestrators above Backend.AI can thin out their retry layers.
+- Opt-in automatic retry for `BATCH` sessions with a `RetryPolicy` accepted at session creation.
+- Each retry is a fresh session linked to its parent — no kernel reuse, no new status state.
+- Default `max_retries=0` keeps current behavior intact.
+- A single user-facing knob: setting `max_retries > 0` retries on any non-permanent failure.
 
 ## Current Design
 
-Session statuses are defined in `src/ai/backend/manager/data/session/types.py:30-51`:
+### Session lifecycle
+
+`SessionStatus` (`src/ai/backend/manager/data/session/types.py:30-50`) defines the lifecycle:
 
 ```
 PENDING → SCHEDULED → PREPARING → PULLING → PREPARED → CREATING → RUNNING → TERMINATING → TERMINATED
 ```
 
-Terminal statuses with no further transitions: `ERROR`, `TERMINATED`, `CANCELLED`. `SessionStatus.retriable_statuses()` (line 118) classifies which startup states are scheduling-retriable, but there is no notion of *re-creating* a terminal `ERROR` session.
+`terminal_statuses()` (line 109) is `{ERROR, TERMINATED, CANCELLED}` — no transitions out. `retriable_statuses()` (line 118) classifies which startup states the **scheduler** considers retriable for re-dispatch within the same session, but there is no concept of *re-creating* a session that has already gone terminal.
+
+### Session creation path
+
+```
+POST /v2/sessions
+  → CreateFromParamsAction
+  → SessionService.create_from_params (services/session/service.py:255)
+  → repository → SessionRow (models/session/row.py:384)
+```
+
+`SessionRow.creation_id` (lines 389–390) is a 32-character idempotency key reused across kernel placements; we can extend it to also key retry attempts.
+
+There are no fields for `parent_session_id`, `retry_count`, `max_retries`, or a retry policy on `SessionRow`.
 
-Session creation flows through `API handler → SessionService.create_from_params() → repository → SessionRow`. `SessionRow.creation_id` already exists as an idempotency key. There are no fields for `parent_session_id`, `retry_count`, `max_retries`, or a retry policy.
+### Termination event handling
 
-The termination event handler (`event_dispatcher/handlers/session.py`) listens to `session.terminated` / `session.error` but has no retry decision hook.
+`SessionEventHandler` (`event_dispatcher/handlers/session.py:52`) already subscribes to the relevant events:
 
-No prior BEP covers session retry or fault tolerance.
+| Method | Event | Line |
+|---|---|---|
+| `handle_session_started` | `SessionStartedAnycastEvent` | 88 |
+| `handle_session_cancelled` | `SessionFailureAnycastEvent` | 105 |
+| `handle_session_terminating` | `SessionTerminatingAnycastEvent` | 118 |
+| `handle_session_terminated` | `SessionTerminatedAnycastEvent` | 130 |
+
+`handle_session_terminated` already consults `session.status_data["error"]` for endpoint-route bookkeeping, so the failure metadata needed for retry classification is already on hand at this point. What is missing is the decision: "should we spawn a child session?"
+
+No prior BEP covers session retry or fault tolerance. BEP-1030 (scheduler status transitions) covers in-session retries by the scheduler, not session re-creation.
 
 ## Proposed Design
 
 ### Mental model
 
-`max_retries > 0` means "retry on failure." Users should not need to opt in twice. Apache Airflow takes the same stance — any non-fatal exception triggers retry up to `retries`. The classification's job is only to exclude failures that semantically must not retry (cancellation, validation, quota), not to gate ordinary failure modes.
+`max_retries > 0` means "retry on failure." Users should not need to opt in twice. Apache Airflow takes the same stance — any non-fatal exception triggers retry up to `retries`. Classification's job is only to exclude failures that semantically must not retry (cancellation, validation, quota), not to gate ordinary failure modes.
 
 ### `RetryPolicy` schema
 
-A Pydantic DTO accepted at session creation, modeled on Airflow's parameter surface:
+A Pydantic DTO at `common/dto/manager/v2/session/retry_policy.py` (per the manager `data/` layer rule that Pydantic models live under `dto/`, not `data/`). Schema modeled on Airflow's parameter surface:
 
 ```python
 class BackoffStrategy(StrEnum):
@@ -72,11 +98,7 @@ class RetryEligibleCause(StrEnum):
 
     @classmethod
     def defaults(cls) -> frozenset["RetryEligibleCause"]:
-        return frozenset({
-            cls.AGENT_TRANSIENT, cls.SCHEDULER_TIMEOUT,
-            cls.IMAGE_PULL_FAILURE, cls.KERNEL_NONZERO_EXIT,
-            cls.OOM_KILLED, cls.UNKNOWN,
-        })
+        return frozenset(cls)
 
 class RetryPolicy(BaseModel):
     max_retries: NonNegativeInt = 0
@@ -92,31 +114,17 @@ class RetryPolicy(BaseModel):
     emit_retry_events: bool = True
 ```
 
-Mapping to Airflow:
-
-| Airflow | `RetryPolicy` |
-|---|---|
-| `retries` | `max_retries` (count, total attempts = `1 + max_retries`) |
-| `retry_delay` | `retry_delay` (seconds) |
-| `retry_exponential_backoff` (multiplier) | `backoff: fixed\|exponential` + `backoff_multiplier` |
-| `max_retry_delay` (with 24 h hard ceiling) | `max_retry_delay` (24 h hard ceiling preserved) |
-| SHA1-deterministic jitter | `jitter` (selectable: none / deterministic / random), `jitter_ratio` |
-| Exception-typed eligibility | Structural enum `RetryEligibleCause` |
-| `on_retry_callback` | `session.retry_scheduled` / `session.retry_exhausted` events |
-| `default_args` precedence | Per-session > project/domain default > etcd cluster default |
-| `email_on_retry` | Subsumed by event subscription via webhook plugin |
+Notable deviations from Airflow:
 
-Deviations from Airflow and their reasons:
-
-- **No callback parameter.** Keeps the policy serializable and the server's behavior auditable. Backend.AI is event-driven; downstream consumers subscribe to `session.retry_*` events.
-- **Structural cause enum, not exception types.** Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process.
-- **`max_retries` is a count.** Total attempts = `1 + max_retries`, matching Backend.AI conventions and the existing pipeline orchestrator.
+- **No callback parameter.** Backend.AI is event-driven; downstream consumers subscribe to `session.retry_*` events instead of registering an `on_retry_callback`. Keeps the policy serializable and the server's behavior fully auditable.
+- **Structural cause enum, not exception types.** Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process; classification reads `status_data` instead.
+- **`max_retries` is a count.** Total attempts = `1 + max_retries`, matching Backend.AI naming and the existing pipeline orchestrator on top of Backend.AI.
 
 ### Failure classification
 
-A central `classify_failure(session, status_data) → RetryEligibleCause`. Hardcoded non-retriable causes outside the enum: `USER_CANCELLED`, `VALIDATION_ERROR`, `QUOTA_EXCEEDED`. Users cannot opt these into retry.
+A central `classify_failure(session, status_data) → RetryEligibleCause`. Hardcoded never-retriable causes live outside the enum: `USER_CANCELLED`, `VALIDATION_ERROR`, `QUOTA_EXCEEDED`. Users cannot opt these into retry.
 
-| Cause | Default eligible | Notes |
+| Cause | In default eligible set | Notes |
 |---|---|---|
 | `AGENT_TRANSIENT` | yes | Lost heartbeat, agent restart mid-run. |
 | `SCHEDULER_TIMEOUT` | yes | Kernel-creation timeout under cluster pressure. |
@@ -138,126 +146,98 @@ delay = apply_jitter(base, mode=jitter, ratio=jitter_ratio,
 delay = min(delay, max_retry_delay or MAX_RETRY_DELAY)
 ```
 
-`MAX_RETRY_DELAY` is a hard 24 h ceiling. Deterministic jitter takes `SHA1(session_id || retry_count) mod (base * jitter_ratio)`, yielding reproducible delays — useful for tests. Random jitter samples uniformly in `[base, base * (1 + jitter_ratio))`.
+`MAX_RETRY_DELAY` is a hard 24 h ceiling, matching Airflow. Deterministic jitter is `SHA1(session_id || retry_count) mod (base * jitter_ratio)`, yielding reproducible delays — useful for tests. Random jitter samples uniformly in `[base, base * (1 + jitter_ratio))`.
 
 ### Defaults precedence
 
 Three layers, matching Airflow's `default_args` propagation:
 
 1. Per-session policy in the create request.
-2. Project / domain default (new optional field, admin-managed).
-3. Cluster default in etcd: `config/manager/retry_policy_default`. Ship default: `max_retries=0` → no behavior change.
+2. Project / domain default (new optional field on the project config; admin-managed).
+3. Cluster default in etcd: `config/manager/retry_policy_default`.
 
-Effective policy = deep-merge top-down; per-session wins.
+Effective policy = deep-merge top-down; per-session wins. Ship default at layer 3 is `max_retries=0`.
 
 ### Data model
 
 One Alembic migration adds to `sessions`:
 
-```
-parent_session_id : UUID NULL  (self-FK)
-retry_count       : INT  NOT NULL DEFAULT 0
-max_retries       : INT  NOT NULL DEFAULT 0
-retry_policy      : JSONB NULL
-retry_cause       : TEXT NULL
-```
+| Column | Type | Description |
+|---|---|---|
+| `parent_session_id` | `UUID NULL` | Self-FK to `sessions.id`; null for the first attempt. |
+| `retry_count` | `INT NOT NULL DEFAULT 0` | 0 for the first attempt. |
+| `max_retries` | `INT NOT NULL DEFAULT 0` | Denormalized from policy for cheap filters. |
+| `retry_policy` | `JSONB NULL` | Full policy. |
+| `retry_cause` | `TEXT NULL` | Classified cause that triggered the most recent retry into this attempt. |
 
-Rationale: `parent_session_id`, `retry_count`, `max_retries` are first-class columns because they are queried for filters and joins. The rest live in JSONB. **No new history table** — the chain is a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. Cheaper than a separate history table and consistent with Backend.AI's existing model.
+`parent_session_id`, `retry_count`, and `max_retries` are first-class columns because they appear in filters and joins; the rest live in JSONB. **No new history table** — the chain is already a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. The migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`.
 
 ### Decision and dispatch
 
-A new handler at `event_dispatcher/handlers/session_retry.py` subscribes to `session.terminated` / `session.error`:
+The retry decision is added to the existing termination-event path. Two integration points are equivalent in correctness; the implementation PR will pick one:
+
+- **Extend `SessionEventHandler`** in `event_dispatcher/handlers/session.py` with a `handle_session_failure` method (or fold the decision into `handle_session_terminated`), since failure metadata is already read there for endpoint-route bookkeeping.
+- **Add a sokovan post-processor** under `sokovan/scheduler/post_processors/`, invoked when the scheduler observes a session entering a terminal failure state.
+
+The decision flow is the same regardless:
 
 1. Load session. If `retry_count >= max_retries` → emit `session.retry_exhausted` and return.
-2. Classify failure. If cause not in `eligible_causes` (or in hardcoded never-retry set) → return.
-3. Acquire row lock with `select_for_update()`. If a child with deterministic `creation_id = parent.creation_id + ":retry:" + (retry_count + 1)` already exists → return (idempotency).
+2. Classify failure via `classify_failure(session, status_data)`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return.
+3. Acquire a row lock with `select_for_update()`. If a child whose deterministic `creation_id = parent.creation_id + ":retry:" + (retry_count + 1)` already exists, return (idempotency).
 4. Compute `delay` per the formula above.
 5. Schedule retry creation through the existing background task / event mechanism with the computed delay. Do not block the handler on a sleep.
 
-The retry path calls `SessionService.create_from_params()` with a `CreateFromParamsAction` derived from the parent (image, mounts, resource_slots, env, cluster spec, batch entrypoint). The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`.
+The retry path calls `SessionService.create_from_params()` with a `CreateFromParamsAction` derived from the parent (image, mounts, `resource_slots`, env, cluster spec, batch entrypoint). The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`.
 
-**No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` field on the API tells clients "attempt N of M" / "this session has a pending child." This avoids touching the scheduler state machine.
+**No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` field on the API tells clients "attempt N of M" or "this session has a pending child." This avoids touching the scheduler state machine.
 
 ### API surface
 
 REST v2 (`api/rest/v2/sessions/`):
 
-- `POST /sessions` — accept optional `retry_policy` in the request body.
-- `GET /sessions/{id}` — return `parent_session_id`, `retry_count`, `max_retries`, `retry_policy`, `retry_cause`, plus computed `retry_chain` (oldest → newest IDs).
-- `GET /sessions/{id}/attempts` — return the chain with status of each attempt.
+| Method | Path | Purpose |
+|---|---|---|
+| `POST` | `/sessions` | Accept optional `retry_policy` in `SessionCreateRequest`. |
+| `GET` | `/sessions/{id}` | Return `parent_session_id`, `retry_count`, `max_retries`, `retry_policy`, `retry_cause`, plus computed `retry_chain` (oldest → newest IDs). |
+| `GET` | `/sessions/{id}/attempts` | Return the chain with the status of each attempt. |
 
-GraphQL v2: mirror in `api/gql/session/types.py` — `parentSession`, `retryCount`, `maxRetries`, `retryPolicy`, `retryCause`, resolver `retryChain`.
+GraphQL v2: mirror in `api/gql/session/types.py` — `parentSession`, `retryCount`, `maxRetries`, `retryPolicy`, `retryCause`, `retryChain` resolver.
 
-Client SDK v2 + CLI v2: expose new fields; `./bai session info` shows `attempt N of M` and links to the parent.
+Client SDK v2 + CLI v2: expose the new fields; `./bai session info` shows `attempt N of M` and links to the parent.
 
-**No retry mutation in v1.** Manual retry is deferred until the auto path is stable.
+No retry mutation in v1; manual retry is deferred until the auto path stabilizes.
 
 ### Observability
 
 - Counters: `bai_session_retry_scheduled_total{cause}`, `bai_session_retry_exhausted_total{cause}`, `bai_session_retry_succeeded_total`.
-- Events: `session.retry_scheduled`, `session.retry_exhausted` — consumable by the webhook plugin. Replace the role of Airflow's `on_retry_callback` for downstream consumers.
-- Audit log entry per retry dispatch (auto, cause, attempt N of M).
+- Events: `session.retry_scheduled`, `session.retry_exhausted` — consumable by the webhook plugin, replacing the role of Airflow's `on_retry_callback` for downstream consumers.
+- Audit log entry per retry dispatch: cause and attempt N of M.
 
 ## Migration / Compatibility
 
-### Backward compatibility
-
-- Default `max_retries=0` ⇒ zero behavior change for existing callers.
-- All new columns are nullable or default to safe zero values.
-- Existing GraphQL and REST clients continue to work; new fields are additive.
-
-### Migration steps
-
-1. Apply Alembic migration adding the five columns. Migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`.
-2. Deploy manager with retry handler and surface, default off via etcd.
-3. Operators opt in by setting cluster default or per-session policy.
-4. External orchestrators may continue using their own retry layers; migration to native retry is independent and incremental.
-
-### Breaking changes
-
-None.
+- Default `max_retries=0` keeps behavior unchanged for every existing caller.
+- All new columns are nullable or default to safe zero values; the Alembic migration is purely additive.
+- Existing GraphQL and REST clients continue to work; new fields are additive on responses.
+- Operators opt in by setting the cluster default in etcd or a per-session policy.
+- External orchestrators may continue using their own retry layers; migration to native retry is independent and incremental.
+- No breaking changes.
 
 ## Implementation Plan
 
 Six PRs, each tracked by its own sub-issue under #11320:
 
 1. **BEP draft** (this document) — #11321.
-2. **Foundation:** `RetryPolicy` DTO, `classify_failure` module, backoff utility (with deterministic jitter). Pure functions, no I/O, unit-test heavy.
-3. **Schema:** Alembic migration, `SessionRow` field expansion, repository read/write for retry chain. Backportable.
-4. **Retry engine:** event handler, `SessionService.create_from_params` extension, defaults precedence (project/domain/etcd), counters/events/audit.
+2. **Foundation:** `RetryPolicy` DTO, `classify_failure`, backoff utility with deterministic jitter. Pure, no I/O, unit-test heavy.
+3. **Schema:** Alembic migration, `SessionRow` field expansion, repository read/write for the retry chain.
+4. **Retry engine:** decision integration in the termination-event path, `SessionService.create_from_params` extension to inherit retry context, defaults precedence (project/domain/etcd), counters/events/audit.
 5. **API surface:** REST v2 and GraphQL v2 fields, `attempts` endpoint.
 6. **Client:** SDK v2, CLI v2 (`./bai session info` retry view), user docs.
 
-Tests live with the code under test. Cross-cutting integration tests (transient → retry → success; exhaustion path; concurrent dispatch idempotency; jitter determinism) ship with the retry-engine PR.
-
-Estimated effort: three to four weeks for one engineer.
-
-## Decision Log
-
-| Date | Decision | Rationale |
-|------|----------|-----------|
-| 2026-04-27 | Batch sessions only in v1 | Interactive sessions are user-driven and do not fit auto-retry semantics. |
-| 2026-04-27 | Each retry is a fresh session, linked via `parent_session_id` | Matches existing pipeline orchestrator semantics; avoids reusing kernels/scratch and the complexity that would entail. |
-| 2026-04-27 | No new `RETRYING` status | Parent goes to `ERROR`, child starts `PENDING` — avoids touching the scheduler state machine. Computed `retry_state` on the API is enough for clients. |
-| 2026-04-27 | Linked-list chain, not a separate history table | The chain is already a list of real `SessionRow`s; no need to duplicate. |
-| 2026-04-27 | Structural `RetryEligibleCause` enum, not exception-typed | Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process. |
-| 2026-04-27 | `KERNEL_NONZERO_EXIT` is in the default eligible set | `max_retries > 0` should be the only knob a typical user touches; matches Airflow's "retry on failure, period" model. |
-| 2026-04-27 | `USER_CANCELLED` / `VALIDATION_ERROR` / `QUOTA_EXCEEDED` are hardcoded non-retriable | These are permanent by definition; users cannot opt them into retry. |
-| 2026-04-27 | No retry mutation in v1 | Auto path stabilizes first; manual retry's interaction with `max_retries` is itself a design decision. |
-| 2026-04-27 | Idempotency via deterministic child `creation_id` | Reuses an existing field; no new uniqueness constraint required. |
-| 2026-04-27 | Deterministic jitter seed = `(session_id, retry_count)` | Reproducible for tests; trade-off vs. unpredictability is acceptable for a server-side retry. |
-
-## Open Questions
-
-- Quota accounting: do retries count against concurrent-session limits? Likely yes, but needs a product call.
-- Retry-storm kill switch: should the etcd default be a single boolean toggle, a rate limit, or both? Leaning toward a boolean for v1 with a rate limit deferred.
-- Manual retry in v2: counts toward `max_retries` or independent? Decide before exposing.
-- Default for `max_retry_delay`: 1 h is conservative for long-running batch jobs that might benefit from a longer cooldown after repeated failures. Revisit after telemetry.
-- Project/domain defaults table location: extend an existing table or add a small new `project_retry_defaults` table?
+Tests live with the code under test. Cross-cutting integration tests — transient → retry → success, exhaustion, concurrent dispatch idempotency, jitter determinism — ship with the retry-engine PR. Estimated effort: three to four weeks for one engineer.
 
 ## References
 
 - Working draft: `docs/investigation/native-session-retry-plan.md`
 - Apache Airflow retry implementation: `airflow-core/src/airflow/models/taskinstance.py:1109-1159`
-- Existing scheduler state-machine BEP: [BEP-1030](BEP-1030-sokovan-scheduler-status-transition.md)
+- [BEP-1030: Sokovan Scheduler Status Transition Design](BEP-1030-sokovan-scheduler-status-transition.md)
 - Alembic backport strategy: `src/ai/backend/manager/models/alembic/README.md`

From 8305ac04a2a0a076316ea5cad9fff60cc04ffa25 Mon Sep 17 00:00:00 2001
From: Jeongseok Kang <jskang@lablup.com>
Date: Mon, 27 Apr 2026 11:11:32 +0900
Subject: [PATCH 4/9] docs(BA-5851): use SQLAlchemy with_for_update in BEP-1053

Replace the Django-style select_for_update() reference with SQLAlchemy 2.x
syntax matching the existing pattern in repositories/agent/db_source.py and
repositories/deployment/db_source.py: sa.select(SessionRow).where(...)
.with_for_update() inside begin_session().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 proposals/BEP-1053-native-session-retry.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md
index bc40ec329cd..52fbc71fb8b 100644
--- a/proposals/BEP-1053-native-session-retry.md
+++ b/proposals/BEP-1053-native-session-retry.md
@@ -183,7 +183,7 @@ The decision flow is the same regardless:
 
 1. Load session. If `retry_count >= max_retries` → emit `session.retry_exhausted` and return.
 2. Classify failure via `classify_failure(session, status_data)`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return.
-3. Acquire a row lock with `select_for_update()`. If a child whose deterministic `creation_id = parent.creation_id + ":retry:" + (retry_count + 1)` already exists, return (idempotency).
+3. Lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` inside the session repository's `begin_session()` transaction. If a child whose deterministic `creation_id = parent.creation_id + ":retry:" + (retry_count + 1)` already exists, return (idempotency).
 4. Compute `delay` per the formula above.
 5. Schedule retry creation through the existing background task / event mechanism with the computed delay. Do not block the handler on a sleep.
 

From f0794157d7239fd6bddba6fa6a55ca91756985d3 Mon Sep 17 00:00:00 2001
From: Jeongseok Kang <jskang@lablup.com>
Date: Mon, 27 Apr 2026 11:16:47 +0900
Subject: [PATCH 5/9] docs(BA-5851): tighten BEP-1053 idempotency, dispatch,
 and dependencies

Address production-readiness review findings:

- Add partial unique index on (parent_session_id, retry_count) as the real
  idempotency guarantee for retry dispatch. The parent row lock alone is
  insufficient because creation_id has unique=False
  (models/session/row.py:390); under concurrent handlers the second INSERT
  would have succeeded silently. The unique index makes duplicate child
  creation a hard failure.

- Commit to extending SessionEventHandler in event_dispatcher/handlers/
  session.py for the retry decision, instead of leaving "two equivalent
  integration points." Sokovan post-processors run during scheduling
  iterations, which complicates idempotency without adding capability.
  Also notes the recent sokovan refactor (#11250 / 8321c79aa) so future
  readers see why the post-processor path was rejected.

- Specifically name BackgroundTaskManager.start_retriable() (already
  injected into SessionService at service.py:245,408) as the dispatch
  primitive instead of vague "background task / event mechanism."

- Defer project/domain default layer to a follow-up after BEP-1052
  (Scoped App Config Redesign) lands, so this BEP doesn't conflict with
  in-flight config-surface work.

- Document the retry handler's own failure modes (classify_failure raise,
  start_retriable enqueue failure) and the accounting policy (each retry
  counts against quota; no refund).

- Clarify retriable_statuses() is unrelated (in-session re-dispatch, not
  session re-creation), point DTO location reference to the manager
  data/CLAUDE.md rule, mark retry_state as an API-layer resolver.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 proposals/BEP-1053-native-session-retry.md | 45 ++++++++++++++--------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md
index 52fbc71fb8b..83dbff823c0 100644
--- a/proposals/BEP-1053-native-session-retry.md
+++ b/proposals/BEP-1053-native-session-retry.md
@@ -38,7 +38,7 @@ The retry concern is therefore pushed to every higher-level orchestrator on top
 PENDING → SCHEDULED → PREPARING → PULLING → PREPARED → CREATING → RUNNING → TERMINATING → TERMINATED
 ```
 
-`terminal_statuses()` (line 109) is `{ERROR, TERMINATED, CANCELLED}` — no transitions out. `retriable_statuses()` (line 118) classifies which startup states the **scheduler** considers retriable for re-dispatch within the same session, but there is no concept of *re-creating* a session that has already gone terminal.
+`terminal_statuses()` (line 109) is `{ERROR, TERMINATED, CANCELLED}` — no transitions out. `retriable_statuses()` (line 118) is unrelated to this BEP: it tells the scheduler which **startup** states are still safe to re-dispatch *within the same session*. This BEP introduces a separate concept — re-creating a fresh session after the previous one has gone terminal.
 
 ### Session creation path
 
@@ -76,7 +76,7 @@ No prior BEP covers session retry or fault tolerance. BEP-1030 (scheduler status
 
 ### `RetryPolicy` schema
 
-A Pydantic DTO at `common/dto/manager/v2/session/retry_policy.py` (per the manager `data/` layer rule that Pydantic models live under `dto/`, not `data/`). Schema modeled on Airflow's parameter surface:
+A Pydantic DTO at `src/ai/backend/common/dto/manager/v2/session/retry_policy.py`, matching the v2 DTO location used by other recent BEPs. Per `src/ai/backend/manager/data/CLAUDE.md`, `data/` is reserved for frozen dataclasses with no framework deps; Pydantic models live under `common/dto/` so they can be shared across REST v2 and GraphQL. Schema modeled on Airflow's parameter surface:
 
 ```python
 class BackoffStrategy(StrEnum):
@@ -150,13 +150,14 @@ delay = min(delay, max_retry_delay or MAX_RETRY_DELAY)
 
 ### Defaults precedence
 
-Three layers, matching Airflow's `default_args` propagation:
+Two layers in v1, matching Airflow's `default_args` spirit while staying compatible with parallel work on the config surface:
 
 1. Per-session policy in the create request.
-2. Project / domain default (new optional field on the project config; admin-managed).
-3. Cluster default in etcd: `config/manager/retry_policy_default`.
+2. Cluster default in etcd: `config/manager/retry_policy_default` (ship default `max_retries=0` — no behavior change).
 
-Effective policy = deep-merge top-down; per-session wins. Ship default at layer 3 is `max_retries=0`.
+Effective policy = deep-merge top-down; per-session wins.
+
+**Project / domain default is deferred.** [BEP-1052 (Scoped App Config Redesign)](BEP-1052-scoped-app-config-redesign.md) is concurrently rewriting the project / domain config surface around scoped `AppConfigFragment` rows. Adding `retry_policy_default` to the legacy project config row would conflict with that work. After BEP-1052 lands, a follow-up BEP can wire retry defaults into `AppConfigFragment` as a third precedence layer.
 
 ### Data model
 
@@ -170,26 +171,28 @@ One Alembic migration adds to `sessions`:
 | `retry_policy` | `JSONB NULL` | Full policy. |
 | `retry_cause` | `TEXT NULL` | Classified cause that triggered the most recent retry into this attempt. |
 
-`parent_session_id`, `retry_count`, and `max_retries` are first-class columns because they appear in filters and joins; the rest live in JSONB. **No new history table** — the chain is already a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. The migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`.
+The migration also adds a **partial unique index** on `(parent_session_id, retry_count) WHERE parent_session_id IS NOT NULL`. This is the actual idempotency guarantee for retry dispatch: even if two workers race past the parent row lock (different transactions, different timing), the second `INSERT` of a child with the same `(parent, attempt-number)` fails on the unique violation. `creation_id` remains non-unique and is used only for log/trace correlation.
 
-### Decision and dispatch
+`parent_session_id`, `retry_count`, and `max_retries` are first-class columns because they appear in filters, joins, and the unique index; the rest live in JSONB. `parent_session_id` is the canonical query for "show me the retry chain of this session." **No new history table** — the chain is already a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. The migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`.
 
-The retry decision is added to the existing termination-event path. Two integration points are equivalent in correctness; the implementation PR will pick one:
+### Decision and dispatch
 
-- **Extend `SessionEventHandler`** in `event_dispatcher/handlers/session.py` with a `handle_session_failure` method (or fold the decision into `handle_session_terminated`), since failure metadata is already read there for endpoint-route bookkeeping.
-- **Add a sokovan post-processor** under `sokovan/scheduler/post_processors/`, invoked when the scheduler observes a session entering a terminal failure state.
+The retry decision lives in `SessionEventHandler` (`event_dispatcher/handlers/session.py:52`), as a new `handle_session_failure` method on the existing class. Rationale: failure metadata (`session.status_data["error"]`) is already loaded there for endpoint-route bookkeeping, the handler runs after the session has reached a terminal status (so the parent state is settled), and adding logic here does not interact with the recently refactored sokovan termination flow (#11250 — `mark_sessions_for_termination()` in `sokovan/scheduling_controller/scheduling_controller.py:266`). A sokovan post-processor was considered but rejected for v1: it runs *during* scheduling iterations, which complicates idempotency and timing without adding capability the event-handler path lacks.
 
-The decision flow is the same regardless:
+The decision flow:
 
-1. Load session. If `retry_count >= max_retries` → emit `session.retry_exhausted` and return.
+1. Load the parent session. If `retry_count >= max_retries`, emit `session.retry_exhausted` and return.
 2. Classify failure via `classify_failure(session, status_data)`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return.
-3. Lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` inside the session repository's `begin_session()` transaction. If a child whose deterministic `creation_id = parent.creation_id + ":retry:" + (retry_count + 1)` already exists, return (idempotency).
+3. Inside the session repository's `begin_session()` transaction, lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` and re-read `retry_count` to handle racing handlers on the same parent.
 4. Compute `delay` per the formula above.
-5. Schedule retry creation through the existing background task / event mechanism with the computed delay. Do not block the handler on a sleep.
+5. Hand off to `BackgroundTaskManager.start_retriable()` (already injected into `SessionService` at `services/session/service.py:245,408`) with the computed delay and a `CreateFromParamsAction` derived from the parent. The background task framework is already the canonical primitive for durable, replayable, delayed work in the manager — using it avoids inventing a new scheduling path.
+6. The child `INSERT` is the second idempotency boundary: the partial unique index on `(parent_session_id, retry_count)` rejects duplicate dispatches that bypass step 3 (e.g., handler crash + replay).
+
+The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`. The `CreateFromParamsAction` carries the same image, mounts, `resource_slots`, env, cluster spec, and batch entrypoint as the parent.
 
-The retry path calls `SessionService.create_from_params()` with a `CreateFromParamsAction` derived from the parent (image, mounts, `resource_slots`, env, cluster spec, batch entrypoint). The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`.
+**Failure mode of the retry handler itself.** If `classify_failure` raises, the session stays in its terminal state and the failure is logged at ERROR level — no retry, no crash propagation. If `BackgroundTaskManager.start_retriable()` fails to enqueue, the parent's `status_data` is annotated with the dispatch failure and `session.retry_exhausted` is emitted. The handler must not raise out of `handle_session_failure`; an unhandled exception in an event handler can stall the dispatcher.
 
-**No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` field on the API tells clients "attempt N of M" or "this session has a pending child." This avoids touching the scheduler state machine.
+**No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` (resolved at the API layer, not stored) tells clients "attempt N of M" or "this session has a pending child." This avoids touching the scheduler state machine entirely.
 
 ### API surface
 
@@ -222,6 +225,14 @@ No retry mutation in v1; manual retry is deferred until the auto path stabilizes
 - External orchestrators may continue using their own retry layers; migration to native retry is independent and incremental.
 - No breaking changes.
 
+### Quota and accounting
+
+A retry attempt is a fresh `SessionRow` and counts against the user's concurrent-session limit while it is alive — same as if the user had re-submitted manually. The previous attempt's resource consumption is not refunded; this matches the principle that "actual GPU/CPU time was spent, regardless of why the session ended." The API exposes the chain so accounting tools can group attempts under one logical job if they choose.
+
+### Operational kill switch
+
+The cluster-level etcd default doubles as a kill switch: setting `config/manager/retry_policy_default` to `{max_retries: 0}` disables retries globally without redeploying the manager. Per-project / per-user kill switches are deferred until the project-default layer lands (see [BEP-1052](BEP-1052-scoped-app-config-redesign.md) dependency above).
+
 ## Implementation Plan
 
 Six PRs, each tracked by its own sub-issue under #11320:

From 461d0e193a6a14bdacba2cd27da33ac564b01093 Mon Sep 17 00:00:00 2001
From: Jeongseok Kang <jskang@lablup.com>
Date: Mon, 27 Apr 2026 11:22:40 +0900
Subject: [PATCH 6/9] docs(BA-5851): guard BATCH-only and pin status_data error
 contract

Address second-pass review:

- Make the BATCH session-type guard explicit at step 1 of the decision
  flow. SessionEventHandler is shared across BATCH / INTERACTIVE /
  INFERENCE; without the guard, handle_session_failure would fire for
  INFERENCE failures too (the same handler already has INFERENCE-specific
  routing logic at line 210), violating the stated v1 scope.

- Pin the status_data["error"] contract to manager/exceptions.py:
  convert_to_status_data and the ErrorStatusInfo/ErrorDetail TypedDicts
  (line 97). classify_failure reads error.name and error.src to map to a
  RetryEligibleCause; without this pin, the classifier would depend on an
  undocumented shape.

Reviewer also confirmed no duplication with pre-existing features
(RestartTracker, scheduler retriable_statuses, BEP-1049 deployment
retry) -- all PARALLEL, none cover session-level retry of BATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 proposals/BEP-1053-native-session-retry.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md
index 83dbff823c0..e4d923b3ff9 100644
--- a/proposals/BEP-1053-native-session-retry.md
+++ b/proposals/BEP-1053-native-session-retry.md
@@ -181,8 +181,8 @@ The retry decision lives in `SessionEventHandler` (`event_dispatcher/handlers/se
 
 The decision flow:
 
-1. Load the parent session. If `retry_count >= max_retries`, emit `session.retry_exhausted` and return.
-2. Classify failure via `classify_failure(session, status_data)`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return.
+1. Load the parent session. **Short-circuit unless `session.session_type == SessionTypes.BATCH`** — interactive and inference sessions share `SessionEventHandler` (see line 210's INFERENCE-specific routing in `handle_session_terminated`) and are explicitly out of scope for v1. Then, if `retry_count >= max_retries`, emit `session.retry_exhausted` and return.
+2. Classify failure via `classify_failure(session, status_data)`. The shape of `status_data["error"]` is defined by `manager/exceptions.py:convert_to_status_data` (returns `ErrorStatusInfo` / `ErrorDetail` TypedDicts at line 97); classification reads `error.name` and `error.src` to map to a `RetryEligibleCause`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return.
 3. Inside the session repository's `begin_session()` transaction, lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` and re-read `retry_count` to handle racing handlers on the same parent.
 4. Compute `delay` per the formula above.
 5. Hand off to `BackgroundTaskManager.start_retriable()` (already injected into `SessionService` at `services/session/service.py:245,408`) with the computed delay and a `CreateFromParamsAction` derived from the parent. The background task framework is already the canonical primitive for durable, replayable, delayed work in the manager — using it avoids inventing a new scheduling path.

From 7e65357ea00fd304b9b588fdefe3a10f9f345463 Mon Sep 17 00:00:00 2001
From: Jeongseok Kang <jskang@lablup.com>
Date: Mon, 27 Apr 2026 11:28:58 +0900
Subject: [PATCH 7/9] docs(BA-5851): rework dispatch primitive and kill switch
 in BEP-1053

Three real bugs from third-pass review confirmed against the code:

1) BackgroundTaskManager.start_retriable (common/bgtask/bgtask.py:444)
   does NOT accept a delay parameter -- it fires immediately via
   asyncio.create_task. The "retriable" name refers to the task body
   retrying on failure, not delayed scheduling. The BEP picked this
   primitive based on its name alone. Replace with a durable
   session_retry_dispatch_queue table + periodic claim worker (outbox
   pattern). Survives manager restarts; idempotency matches the
   sessions-table partial unique index.

2) Adding a sibling handle_session_failure method on SessionEventHandler
   would have created a second handler on SessionFailureAnycastEvent
   (already subscribed by handle_batch_result at dispatch.py:520),
   racing against existing bookkeeping with undefined ordering. Fold
   the retry decision INTO handle_batch_result instead, in the
   SessionFailureAnycastEvent arm.

3) The "kill switch via cluster default" was contradictory: per-session
   policy wins on merge, so any user setting max_retries:N bypassed it.
   Split into two separate etcd keys: retry_policy_default (a default,
   merged) and retry_disabled (boolean, checked at the top of the
   decision flow before merge -- a true kill switch).

Also: drop the creation_id retry-suffix idea entirely (creation_id is
String(32) and could overflow). The partial unique index on
(parent_session_id, retry_count) is the only idempotency boundary now;
creation_id stays a per-attempt random token.

Implementation Plan grew from 6 to 7 PRs (queue + dispatcher worker is
its own PR); estimate bumped from 3-4 to 4-5 weeks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 proposals/BEP-1053-native-session-retry.md | 63 ++++++++++++++--------
 1 file changed, 42 insertions(+), 21 deletions(-)

diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md
index e4d923b3ff9..bc57f38dd66 100644
--- a/proposals/BEP-1053-native-session-retry.md
+++ b/proposals/BEP-1053-native-session-retry.md
@@ -49,7 +49,7 @@ POST /v2/sessions
   → repository → SessionRow (models/session/row.py:384)
 ```
 
-`SessionRow.creation_id` (lines 389–390) is a 32-character idempotency key reused across kernel placements; we can extend it to also key retry attempts.
+`SessionRow.creation_id` (lines 389–390) is a 32-character idempotency key reused across kernel placements; today it is generated as `secrets.token_urlsafe(16)` (`services/session/service.py:1593`). It is **not** extended to encode retry chains — those use a separate first-class column (see Data Model).
 
 There are no fields for `parent_session_id`, `retry_count`, `max_retries`, or a retry policy on `SessionRow`.
 
@@ -148,14 +148,12 @@ delay = min(delay, max_retry_delay or MAX_RETRY_DELAY)
 
 `MAX_RETRY_DELAY` is a hard 24 h ceiling, matching Airflow. Deterministic jitter is `SHA1(session_id || retry_count) mod (base * jitter_ratio)`, yielding reproducible delays — useful for tests. Random jitter samples uniformly in `[base, base * (1 + jitter_ratio))`.
 
-### Defaults precedence
+### Defaults and kill switch
 
-Two layers in v1, matching Airflow's `default_args` spirit while staying compatible with parallel work on the config surface:
+Two distinct concepts, kept separate to avoid the precedence trap of "default doubles as kill switch":
 
-1. Per-session policy in the create request.
-2. Cluster default in etcd: `config/manager/retry_policy_default` (ship default `max_retries=0` — no behavior change).
-
-Effective policy = deep-merge top-down; per-session wins.
+- **Cluster default** in etcd: `config/manager/retry_policy_default` (ship default `max_retries=0` — no behavior change). This is a default; per-session policy wins on merge. Effective policy = deep-merge of cluster default and per-session policy.
+- **Cluster kill switch** in etcd: `config/manager/retry_disabled` (boolean, default `false`). Checked at the **top** of the decision flow, before any policy merge. When `true`, no retries are scheduled regardless of per-session policy. Useful for incident response (e.g., disabling retries cluster-wide during a registry outage that would otherwise cause a retry storm).
 
 **Project / domain default is deferred.** [BEP-1052 (Scoped App Config Redesign)](BEP-1052-scoped-app-config-redesign.md) is concurrently rewriting the project / domain config surface around scoped `AppConfigFragment` rows. Adding `retry_policy_default` to the legacy project config row would conflict with that work. After BEP-1052 lands, a follow-up BEP can wire retry defaults into `AppConfigFragment` as a third precedence layer.
 
@@ -171,26 +169,48 @@ One Alembic migration adds to `sessions`:
 | `retry_policy` | `JSONB NULL` | Full policy. |
 | `retry_cause` | `TEXT NULL` | Classified cause that triggered the most recent retry into this attempt. |
 
-The migration also adds a **partial unique index** on `(parent_session_id, retry_count) WHERE parent_session_id IS NOT NULL`. This is the actual idempotency guarantee for retry dispatch: even if two workers race past the parent row lock (different transactions, different timing), the second `INSERT` of a child with the same `(parent, attempt-number)` fails on the unique violation. `creation_id` remains non-unique and is used only for log/trace correlation.
+The migration also adds a **partial unique index** on `(parent_session_id, retry_count) WHERE parent_session_id IS NOT NULL`. This is the actual idempotency guarantee for retry dispatch: even if two workers race past the parent row lock (different transactions, different timing), the second `INSERT` of a child with the same `(parent, attempt-number)` fails on the unique violation. `creation_id` is unchanged and remains a per-attempt random token (no retry encoding).
+
+A second small table `session_retry_dispatch_queue` is added for durable delayed dispatch (see "Decision and dispatch"):
+
+| Column | Type | Description |
+|---|---|---|
+| `parent_session_id` | `UUID NOT NULL` | FK to `sessions.id`. |
+| `retry_count` | `INT NOT NULL` | Target attempt number (= parent.retry_count + 1). |
+| `scheduled_at` | `TIMESTAMPTZ NOT NULL` | Earliest dispatch time. |
+| `claimed_at` | `TIMESTAMPTZ NULL` | Set when a dispatcher worker claims the row. |
+| `dispatched_at` | `TIMESTAMPTZ NULL` | Set when the child session has been created. |
+
+Primary key `(parent_session_id, retry_count)` — the same constraint that protects `sessions` also serializes queue inserts. The queue lets retry decisions survive manager restarts, mirrors the durable-outbox pattern used by the existing pipeline orchestrator on top of Backend.AI, and avoids inventing in-memory scheduling.
 
 `parent_session_id`, `retry_count`, and `max_retries` are first-class columns because they appear in filters, joins, and the unique index; the rest live in JSONB. `parent_session_id` is the canonical query for "show me the retry chain of this session." **No new history table** — the chain is already a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. The migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`.
 
 ### Decision and dispatch
 
-The retry decision lives in `SessionEventHandler` (`event_dispatcher/handlers/session.py:52`), as a new `handle_session_failure` method on the existing class. Rationale: failure metadata (`session.status_data["error"]`) is already loaded there for endpoint-route bookkeeping, the handler runs after the session has reached a terminal status (so the parent state is settled), and adding logic here does not interact with the recently refactored sokovan termination flow (#11250 — `mark_sessions_for_termination()` in `sokovan/scheduling_controller/scheduling_controller.py:266`). A sokovan post-processor was considered but rejected for v1: it runs *during* scheduling iterations, which complicates idempotency and timing without adding capability the event-handler path lacks.
+The retry decision is **folded into the existing `SessionEventHandler.handle_batch_result`** (`event_dispatcher/handlers/session.py:152`), not added as a sibling handler. Rationale:
+
+- `SessionFailureAnycastEvent` is **already** subscribed by `handle_batch_result` (`event_dispatcher/dispatch.py:520`). Adding a second handler on the same event would race against bookkeeping work (`set_session_result`, etc.) and depend on undefined dispatch ordering.
+- Failure metadata (`session.status_data["error"]`) is already loaded in `handle_batch_result` for the existing failure path; the retry decision can reuse it without new DB roundtrips.
+- The handler runs after the session has reached a terminal status, so parent state is settled, and the change does not interact with the recently refactored sokovan termination flow (#11250 — `mark_sessions_for_termination()` in `sokovan/scheduling_controller/scheduling_controller.py:266`).
+
+A sokovan post-processor was considered but rejected for v1: post-processors run *during* scheduling iterations, complicating idempotency and timing without adding capability the event-handler path lacks.
 
-The decision flow:
+**Dispatch primitive.** `BackgroundTaskManager.start_retriable()` (`common/bgtask/bgtask.py:444`) is **not** suitable: it accepts no `delay` parameter and fires the task immediately via `asyncio.create_task` ("retriable" refers to the task body retrying on failure, not delayed scheduling). Instead, retries are persisted to the new `session_retry_dispatch_queue` table (see Data Model). A periodic loop in the manager (a sokovan-style worker — same cadence as existing periodic tasks) claims rows where `scheduled_at <= now() AND claimed_at IS NULL` via `UPDATE ... SET claimed_at = now() RETURNING ...` (atomic claim under PostgreSQL's row lock) and invokes `SessionService.create_from_params()`. This pattern is durable across manager restarts and matches the outbox approach used by the sibling pipeline orchestrator.
 
-1. Load the parent session. **Short-circuit unless `session.session_type == SessionTypes.BATCH`** — interactive and inference sessions share `SessionEventHandler` (see line 210's INFERENCE-specific routing in `handle_session_terminated`) and are explicitly out of scope for v1. Then, if `retry_count >= max_retries`, emit `session.retry_exhausted` and return.
+The decision flow inside `handle_batch_result` (in the `SessionFailureAnycastEvent` arm):
+
+1. Load the parent session. **Short-circuit unless `session.session_type == SessionTypes.BATCH`** — interactive and inference sessions share `SessionEventHandler` and are out of scope for v1. Then, if `retry_count >= max_retries`, emit `session.retry_exhausted` and return.
 2. Classify failure via `classify_failure(session, status_data)`. The shape of `status_data["error"]` is defined by `manager/exceptions.py:convert_to_status_data` (returns `ErrorStatusInfo` / `ErrorDetail` TypedDicts at line 97); classification reads `error.name` and `error.src` to map to a `RetryEligibleCause`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return.
-3. Inside the session repository's `begin_session()` transaction, lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` and re-read `retry_count` to handle racing handlers on the same parent.
+3. Inside the session repository's `begin_session()` transaction, lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` and re-read `retry_count`.
 4. Compute `delay` per the formula above.
-5. Hand off to `BackgroundTaskManager.start_retriable()` (already injected into `SessionService` at `services/session/service.py:245,408`) with the computed delay and a `CreateFromParamsAction` derived from the parent. The background task framework is already the canonical primitive for durable, replayable, delayed work in the manager — using it avoids inventing a new scheduling path.
-6. The child `INSERT` is the second idempotency boundary: the partial unique index on `(parent_session_id, retry_count)` rejects duplicate dispatches that bypass step 3 (e.g., handler crash + replay).
+5. `INSERT` into `session_retry_dispatch_queue` with `(parent_session_id, retry_count + 1, now() + delay)`. The PK on `(parent_session_id, retry_count)` makes this idempotent: a duplicate dispatch (handler replay, concurrent handlers) hits a unique-violation and the `INSERT` is skipped. Emit `session.retry_scheduled`.
+6. The dispatcher worker eventually claims the row, runs `SessionService.create_from_params` with a `CreateFromParamsAction` derived from the parent, and stamps `dispatched_at`. The child `INSERT` is the second idempotency boundary: the partial unique index on `sessions.(parent_session_id, retry_count)` rejects duplicate child rows even if two workers claim the same queue row through PG bug, replication lag, or operational error.
 
 The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`. The `CreateFromParamsAction` carries the same image, mounts, `resource_slots`, env, cluster spec, and batch entrypoint as the parent.
 
-**Failure mode of the retry handler itself.** If `classify_failure` raises, the session stays in its terminal state and the failure is logged at ERROR level — no retry, no crash propagation. If `BackgroundTaskManager.start_retriable()` fails to enqueue, the parent's `status_data` is annotated with the dispatch failure and `session.retry_exhausted` is emitted. The handler must not raise out of `handle_session_failure`; an unhandled exception in an event handler can stall the dispatcher.
+**Failure mode of the retry decision.** If `classify_failure` raises, the session stays in its terminal state and the failure is logged at ERROR level — no retry, no crash propagation. If the queue `INSERT` fails (DB unavailable), the parent's `status_data` is annotated with the dispatch failure and `session.retry_exhausted` is emitted. The decision must not raise out of `handle_batch_result`; an unhandled exception there would also break existing batch-result bookkeeping.
+
+**Failure mode of the dispatcher worker.** If `create_from_params` raises after the queue row is claimed, the worker stamps `dispatched_at` to a sentinel value and emits `session.retry_exhausted` with the underlying error. Manager restart while a row is claimed-but-not-dispatched: the worker re-claims rows whose `claimed_at` is older than a configurable lease (e.g., 5 minutes) on startup.
 
 **No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` (resolved at the API layer, not stored) tells clients "attempt N of M" or "this session has a pending child." This avoids touching the scheduler state machine entirely.
 
@@ -231,7 +251,7 @@ A retry attempt is a fresh `SessionRow` and counts against the user's concurrent
 
 ### Operational kill switch
 
-The cluster-level etcd default doubles as a kill switch: setting `config/manager/retry_policy_default` to `{max_retries: 0}` disables retries globally without redeploying the manager. Per-project / per-user kill switches are deferred until the project-default layer lands (see [BEP-1052](BEP-1052-scoped-app-config-redesign.md) dependency above).
+`config/manager/retry_disabled` (etcd, boolean) is the cluster-level kill switch — see "Defaults and kill switch" above. Per-project / per-user kill switches are deferred until the project-default layer lands (see [BEP-1052](BEP-1052-scoped-app-config-redesign.md) dependency above).
 
 ## Implementation Plan
 
@@ -239,12 +259,13 @@ Six PRs, each tracked by its own sub-issue under #11320:
 
 1. **BEP draft** (this document) — #11321.
 2. **Foundation:** `RetryPolicy` DTO, `classify_failure`, backoff utility with deterministic jitter. Pure, no I/O, unit-test heavy.
-3. **Schema:** Alembic migration, `SessionRow` field expansion, repository read/write for the retry chain.
-4. **Retry engine:** decision integration in the termination-event path, `SessionService.create_from_params` extension to inherit retry context, defaults precedence (project/domain/etcd), counters/events/audit.
-5. **API surface:** REST v2 and GraphQL v2 fields, `attempts` endpoint.
-6. **Client:** SDK v2, CLI v2 (`./bai session info` retry view), user docs.
+3. **Schema:** Alembic migration adding `SessionRow` retry columns (with the partial unique index) and the `session_retry_dispatch_queue` table; repository read/write for the retry chain.
+4. **Retry decision:** fold the decision into `handle_batch_result` in `SessionEventHandler`, queue insert with idempotency, etcd kill switch and cluster default, counters/events/audit.
+5. **Dispatcher worker:** periodic claim loop on `session_retry_dispatch_queue`, `SessionService.create_from_params` extension to inherit retry context, lease-based recovery on manager restart.
+6. **API surface:** REST v2 and GraphQL v2 fields, `attempts` endpoint.
+7. **Client:** SDK v2, CLI v2 (`./bai session info` retry view), user docs.
 
-Tests live with the code under test. Cross-cutting integration tests — transient → retry → success, exhaustion, concurrent dispatch idempotency, jitter determinism — ship with the retry-engine PR. Estimated effort: three to four weeks for one engineer.
+Tests live with the code under test. Cross-cutting integration tests — transient → retry → success, exhaustion, concurrent dispatch idempotency, jitter determinism, manager-restart recovery of claimed-but-undispatched queue rows — ship with the dispatcher-worker PR. Estimated effort: four to five weeks for one engineer.
 
 ## References
 

From 4dc297c47ebe1218a6e8b7fd0dcac75e8f476d1d Mon Sep 17 00:00:00 2001
From: Jeongseok Kang <jskang@lablup.com>
Date: Mon, 27 Apr 2026 11:33:38 +0900
Subject: [PATCH 8/9] docs(BA-5851): close fourth-pass review gaps in BEP-1053

Three concrete implementation gaps from the latest hostile review:

1) Kill switch read pattern: changed from "read etcd at top of decision
   flow" (hot-path etcd read on every batch failure) to "loaded at
   startup, refreshed via existing EtcdConfigWatcher (config/provider.py:
   20)." Also extends the kill-switch check to the dispatcher worker
   before claiming a queue row, so flipping the switch mid-incident
   halts in-flight queued retries.

2) Queue claim deadlock: pinned the SQL to single-row claim using
   "FOR UPDATE SKIP LOCKED LIMIT 1" inside an UPDATE-from-SELECT.
   Multiple manager replicas can now claim disjoint rows without
   contending on the same lock. Sentinel value for failed dispatch
   ('1970-01-01' timestamptz) made explicit.

3) classify_failure malformed-input fallback: explicitly does NOT
   default to UNKNOWN when status_data is missing or has missing
   required keys; instead returns a never-retriable sentinel and logs
   a WARNING. Only well-formed failures with unrecognized error.name
   map to UNKNOWN. Prevents retry storms from serialization bugs.

Also: dispatcher worker now has a concrete proposed location
(sokovan/scheduler/retry_dispatcher.py).

Reviewer also confirmed (fourth pass) no duplication with existing
queue/outbox patterns: SessionDependencyRow exists but is for kernel
deps, not scheduling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 proposals/BEP-1053-native-session-retry.md | 24 +++++++++++++++++++---
 1 file changed, 21 insertions(+), 3 deletions(-)

diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md
index bc57f38dd66..d1f1cd1dddc 100644
--- a/proposals/BEP-1053-native-session-retry.md
+++ b/proposals/BEP-1053-native-session-retry.md
@@ -153,7 +153,7 @@ delay = min(delay, max_retry_delay or MAX_RETRY_DELAY)
 Two distinct concepts, kept separate to avoid the precedence trap of "default doubles as kill switch":
 
 - **Cluster default** in etcd: `config/manager/retry_policy_default` (ship default `max_retries=0` — no behavior change). This is a default; per-session policy wins on merge. Effective policy = deep-merge of cluster default and per-session policy.
-- **Cluster kill switch** in etcd: `config/manager/retry_disabled` (boolean, default `false`). Checked at the **top** of the decision flow, before any policy merge. When `true`, no retries are scheduled regardless of per-session policy. Useful for incident response (e.g., disabling retries cluster-wide during a registry outage that would otherwise cause a retry storm).
+- **Cluster kill switch** in etcd: `config/manager/retry_disabled` (boolean, default `false`). Loaded at startup and refreshed via the existing `EtcdConfigWatcher` (`manager/config/provider.py:20`) so changes propagate without per-event etcd reads. Checked at the **top** of the decision flow, before any policy merge, **and** by the dispatcher worker before claiming a queue row — so flipping the switch mid-incident also halts in-flight queued retries. When `true`, no retries are scheduled or dispatched regardless of per-session policy. Useful for incident response (e.g., disabling retries cluster-wide during a registry outage that would otherwise cause a retry storm).
 
 **Project / domain default is deferred.** [BEP-1052 (Scoped App Config Redesign)](BEP-1052-scoped-app-config-redesign.md) is concurrently rewriting the project / domain config surface around scoped `AppConfigFragment` rows. Adding `retry_policy_default` to the legacy project config row would conflict with that work. After BEP-1052 lands, a follow-up BEP can wire retry defaults into `AppConfigFragment` as a third precedence layer.
 
@@ -195,12 +195,30 @@ The retry decision is **folded into the existing `SessionEventHandler.handle_bat
 
 A sokovan post-processor was considered but rejected for v1: post-processors run *during* scheduling iterations, complicating idempotency and timing without adding capability the event-handler path lacks.
 
-**Dispatch primitive.** `BackgroundTaskManager.start_retriable()` (`common/bgtask/bgtask.py:444`) is **not** suitable: it accepts no `delay` parameter and fires the task immediately via `asyncio.create_task` ("retriable" refers to the task body retrying on failure, not delayed scheduling). Instead, retries are persisted to the new `session_retry_dispatch_queue` table (see Data Model). A periodic loop in the manager (a sokovan-style worker — same cadence as existing periodic tasks) claims rows where `scheduled_at <= now() AND claimed_at IS NULL` via `UPDATE ... SET claimed_at = now() RETURNING ...` (atomic claim under PostgreSQL's row lock) and invokes `SessionService.create_from_params()`. This pattern is durable across manager restarts and matches the outbox approach used by the sibling pipeline orchestrator.
+**Dispatch primitive.** `BackgroundTaskManager.start_retriable()` (`common/bgtask/bgtask.py:444`) is **not** suitable: it accepts no `delay` parameter and fires the task immediately via `asyncio.create_task` ("retriable" refers to the task body retrying on failure, not delayed scheduling). Instead, retries are persisted to the new `session_retry_dispatch_queue` table (see Data Model). A periodic loop in the manager (placed under `sokovan/` alongside other periodic workers, e.g., `sokovan/scheduler/retry_dispatcher.py`) claims **one row at a time** via:
+
+```sql
+UPDATE session_retry_dispatch_queue
+SET claimed_at = now()
+WHERE (parent_session_id, retry_count) = (
+    SELECT parent_session_id, retry_count
+    FROM session_retry_dispatch_queue
+    WHERE scheduled_at <= now()
+      AND claimed_at IS NULL
+      AND dispatched_at IS NULL
+    ORDER BY scheduled_at
+    FOR UPDATE SKIP LOCKED
+    LIMIT 1
+)
+RETURNING parent_session_id, retry_count;
+```
+
+`FOR UPDATE SKIP LOCKED` lets multiple manager replicas claim disjoint rows without contention, and `LIMIT 1` avoids multi-row claim deadlocks. The worker invokes `SessionService.create_from_params()` for the claimed row and stamps `dispatched_at = now()` on success or `dispatched_at = '1970-01-01'::timestamptz` (sentinel) on failure. This pattern is durable across manager restarts and matches the outbox approach used by the sibling pipeline orchestrator.
 
 The decision flow inside `handle_batch_result` (in the `SessionFailureAnycastEvent` arm):
 
 1. Load the parent session. **Short-circuit unless `session.session_type == SessionTypes.BATCH`** — interactive and inference sessions share `SessionEventHandler` and are out of scope for v1. Then, if `retry_count >= max_retries`, emit `session.retry_exhausted` and return.
-2. Classify failure via `classify_failure(session, status_data)`. The shape of `status_data["error"]` is defined by `manager/exceptions.py:convert_to_status_data` (returns `ErrorStatusInfo` / `ErrorDetail` TypedDicts at line 97); classification reads `error.name` and `error.src` to map to a `RetryEligibleCause`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return.
+2. Classify failure via `classify_failure(session, status_data)`. The shape of `status_data["error"]` is defined by `manager/exceptions.py:convert_to_status_data` (returns `ErrorStatusInfo` / `ErrorDetail` TypedDicts at line 97); classification reads `error.name` and `error.src` to map to a `RetryEligibleCause`. **Malformed-input fallback:** if `status_data` is `None`, `status_data["error"]` is missing, or required keys (`name`, `src`) are missing, `classify_failure` does **not** return `UNKNOWN`. Instead it logs a WARNING and returns the hardcoded never-retriable sentinel — a malformed error envelope is treated as a permanent failure to avoid retry storms on serialization bugs. Only well-formed failures with an unrecognized `error.name` map to `UNKNOWN`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return.
 3. Inside the session repository's `begin_session()` transaction, lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` and re-read `retry_count`.
 4. Compute `delay` per the formula above.
 5. `INSERT` into `session_retry_dispatch_queue` with `(parent_session_id, retry_count + 1, now() + delay)`. The PK on `(parent_session_id, retry_count)` makes this idempotent: a duplicate dispatch (handler replay, concurrent handlers) hits a unique-violation and the `INSERT` is skipped. Emit `session.retry_scheduled`.

From 7f679935f5748385d5cc8ab491acb858f24f728c Mon Sep 17 00:00:00 2001
From: Jeongseok Kang <jskang@lablup.com>
Date: Mon, 27 Apr 2026 13:37:49 +0900
Subject: [PATCH 9/9] docs(BA-5851): split BEP-1053 into two-tier batch
 resilience design

Reviewer feedback (paraphrased):
- "max_retries-style closed enum on the manager side breaks
   extensibility (seen this fail before with hardcoded runtime
   classification)."
- "Most batch retry should live on the agent."
- "Resource/node-level failures (OOM, disconnect) should be rescheduled
   to a different node, not retried in place; don't mutate resource
   allocation."

The original BEP-1053 stacked all of this into a single per-session
RetryPolicy + queue + child sessions. Pivot to two narrower BEPs that
ship independently:

  BEP-1053 (re-scoped): "Agent-level Batch Retry"
    - batch_retries / batch_retry_delay knobs on session creation
    - agent re-runs the entrypoint inside the same kernel
    - no manager-side state, no new tables, no new events
    - ~100 lines, smallest possible delta on Agent.execute_batch

  BEP-1054 (new): "Session Rescheduling on Terminal Failure"
    - new RescheduleFailedBatchSessionsLifecycleHandler under sokovan
    - reuses phase_attempts (no new counter), SERVICE_MAX_RETRIES
      (now made configurable per scaling group, closes its FIXME)
    - extends the existing expired -> PENDING transition pattern to
      fire from terminal-failure with a node-level cause
    - failure classification is etcd pattern config (extensible),
      not a closed enum in code
    - same SessionRow, same allocation; no parent_session_id, no
      child sessions, no queue table

The two BEPs compose: agent-side script retries first; if all attempts
fail and the cause is node-level, the scheduler reschedules to a fresh
node; on the new node, agent-side retries run again. Each attempt is
recorded in scheduling history.

Registry updated, news fragment rewritten. Pivot rationale captured at
docs/investigation/bep-1053-design-pivot.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 changes/11322.doc.md                          |   2 +-
 proposals/BEP-1053-agent-batch-retry.md       | 131 ++++++++
 proposals/BEP-1053-native-session-retry.md    | 293 ------------------
 ...ession-rescheduling-on-terminal-failure.md | 169 ++++++++++
 proposals/README.md                           |   3 +-
 5 files changed, 303 insertions(+), 295 deletions(-)
 create mode 100644 proposals/BEP-1053-agent-batch-retry.md
 delete mode 100644 proposals/BEP-1053-native-session-retry.md
 create mode 100644 proposals/BEP-1054-session-rescheduling-on-terminal-failure.md

diff --git a/changes/11322.doc.md b/changes/11322.doc.md
index 4c246d3f07e..8c3f2ced091 100644
--- a/changes/11322.doc.md
+++ b/changes/11322.doc.md
@@ -1 +1 @@
-Add BEP-1053 proposing native session-level retry for batch sessions, with a `RetryPolicy` schema modeled after Apache Airflow and adapted to Backend.AI's event-driven model
+Add BEP-1053 (agent-level batch entrypoint retry) and BEP-1054 (session rescheduling on terminal failure) covering the two-tier batch resilience design — in-script retry stays on the agent; node-level failures reschedule the same session through the existing scheduler lifecycle handlers
diff --git a/proposals/BEP-1053-agent-batch-retry.md b/proposals/BEP-1053-agent-batch-retry.md
new file mode 100644
index 00000000000..91074f8620f
--- /dev/null
+++ b/proposals/BEP-1053-agent-batch-retry.md
@@ -0,0 +1,131 @@
+---
+Author: Jeongseok Kang (jskang@lablup.com)
+Status: Draft
+Created: 2026-04-27
+Created-Version: 26.5.0
+Target-Version:
+Implemented-Version:
+---
+
+# Agent-level Batch Retry
+
+## Related Issues
+
+- JIRA: BA-5851
+- GitHub Epic: #11320
+- GitHub: #11321
+- Companion BEP: [BEP-1054 — Session Rescheduling on Terminal Failure](BEP-1054-session-rescheduling-on-terminal-failure.md)
+
+## Motivation
+
+When a `BATCH` session's entrypoint exits non-zero, the session is marked failed and the user must manually re-submit. Most batch failures in practice are transient (a flaky network call, a downstream service hiccup, an intermittent dependency error) and a simple in-place re-run would have succeeded. Today the user pays the cost of re-creating the session — re-scheduling, re-pulling the image, re-mounting volumes — for a problem that is purely inside the script.
+
+This BEP adds a small **agent-side** knob: re-run the batch entrypoint inside the same kernel up to N times before reporting failure. It is the simpler, smaller half of the batch-retry feature; the companion BEP-1054 covers the case where the failure is at the *node* level and a fresh schedule is needed.
+
+### Goals
+
+- Opt-in retry of the batch entrypoint inside an existing kernel.
+- No new manager-side state, tables, or events.
+- Default `batch_retries = 0` keeps current behavior.
+- Per-session knob; no policy framework needed at this layer.
+
+### Non-goals
+
+- Failures before the kernel is running (image pull, scheduling). Those go to BEP-1054.
+- OOM and node-level failures. Re-running on the same node typically does not help; BEP-1054 handles them by rescheduling.
+- A user-supplied retry-policy DSL with backoff and classification. Out of scope for v1; if needed, accrue evidence first and design separately.
+
+## Current Design
+
+The agent runs batch entrypoints in `Agent.execute_batch()` (`src/ai/backend/agent/agent.py:2406`). The path:
+
+1. Kernel reaches the running state.
+2. If `kernel_obj.session_type == SessionTypes.BATCH` (`agent.py:2274`), the agent enqueues `execute_batch(session_id, kernel_id, startup_command, batch_timeout)` into `_ongoing_exec_batch_tasks` (line 840).
+3. `execute_batch` invokes the kernel runner via `kernel.execute(...)` once.
+4. On a non-zero exit code (or timeout), the agent emits `SessionFailureAnycastEvent` and `SessionFailureBroadcastEvent` (lines 2375, 2389, 2464, 2478, 2492).
+5. On success, it emits `SessionSuccessAnycastEvent`/`SessionSuccessBroadcastEvent`.
+
+There is no in-script retry — the entrypoint runs exactly once per session. `RestartTracker` (line 757) handles *kernel* restart on agent crash recovery, not script re-execution.
+
+## Proposed Design
+
+### Knob
+
+Two new fields on the batch session creation request, plumbed through the existing kernel-config path that already carries `startup_command` and `batch_timeout`:
+
+| Field | Type | Default | Meaning |
+|---|---|---|---|
+| `batch_retries` | int (≥ 0) | `0` | Maximum number of additional `execute_batch` attempts after the first. Total attempts = `1 + batch_retries`. |
+| `batch_retry_delay` | float seconds (≥ 0) | `0.0` | Wait between attempts. Constant; no backoff at this layer. |
+
+The two fields sit alongside `startup_command`, `bootstrap_script`, and `batch_timeout` in the session creation DTO. They are batch-only — the agent ignores them when `session_type != SessionTypes.BATCH`.
+
+### Execution loop
+
+`execute_batch` becomes:
+
+```python
+async def execute_batch(self, session_id, kernel_id, startup_command, batch_timeout,
+                       batch_retries: int = 0, batch_retry_delay: float = 0.0):
+    last_exit_code: int | None = None
+    for attempt in range(batch_retries + 1):
+        if attempt > 0:
+            log.info("execute_batch(k:{}) retry attempt {}/{}", kernel_id, attempt, batch_retries)
+            await asyncio.sleep(batch_retry_delay)
+        last_exit_code = await self._run_batch_once(session_id, kernel_id, startup_command, batch_timeout)
+        if last_exit_code == 0:
+            await self._emit_session_success(session_id, kernel_id)
+            return
+        # else: non-zero exit -> retry if attempts remain
+    # exhausted
+    await self._emit_session_failure(session_id, kernel_id, last_exit_code)
+```
+
+Only **non-zero exit codes** trigger a retry. Cancellation, timeout, and infrastructure errors (kernel disconnect, container crash) do **not** loop here:
+- Cancellation propagates as today.
+- Timeout (`KernelLifecycleEventReason.TASK_TIMEOUT`, `agent.py:2492`) emits failure as today; rerunning a script that already ran past `batch_timeout` is unhelpful.
+- Container-level failures escalate to BEP-1054's domain.
+
+### Observability
+
+- `bai_agent_batch_retry_attempted_total{session_id_type=batch}` counter (per attempt beyond the first).
+- `bai_agent_batch_retry_succeeded_total` counter (incremented when a retry attempt exits zero).
+- `bai_agent_batch_retry_exhausted_total` counter (incremented when the loop ends with non-zero).
+- Each retry attempt logged at INFO with `(kernel_id, attempt, max_attempts)`.
+- The existing failure event is emitted only on final exhaustion; no new event types.
+
+### What does **not** change
+
+- Session lifecycle, statuses, or transitions.
+- Manager-side handlers (`SessionEventHandler`, sokovan).
+- Database schema.
+- `creation_id`, `parent_session_id` (does not exist), retry chain (does not exist).
+- API surface beyond the two new fields on the create request.
+
+The only manager-side change is plumbing `batch_retries` and `batch_retry_delay` from the create request into the kernel config payload that the agent already receives.
+
+## Migration / Compatibility
+
+- Default `batch_retries = 0` preserves current behavior for every existing caller.
+- New fields are additive on the create request and on responses (echoed back for visibility).
+- No Alembic migration required.
+- Operators have a per-session opt-out by leaving the field unset; no global kill switch needed because the feature is opt-in.
+
+## Implementation Plan
+
+Two PRs:
+
+1. **BEP draft** (this document) plus the companion BEP-1054 — #11321.
+2. **Agent change:** extend `execute_batch` with the retry loop, plumb `batch_retries`/`batch_retry_delay` from kernel config, add metrics, unit tests around the loop semantics.
+3. **Client surface:** SDK v2 + CLI v2 accept the two new fields on `./bai session create -t batch`. REST v2 / GraphQL v2 echo them on session info responses.
+
+Tests live with the code under test. The agent's batch executor has existing test scaffolding; the loop is the smallest possible delta.
+
+Estimated effort: under one week for one engineer, given the constrained scope.
+
+## References
+
+- Companion: [BEP-1054 — Session Rescheduling on Terminal Failure](BEP-1054-session-rescheduling-on-terminal-failure.md)
+- Working draft of the prior single-BEP design and the pivot rationale: `docs/investigation/bep-1053-design-pivot.md`
+- Apache Airflow's `retries` parameter (the inspirational reference): `airflow-core/src/airflow/models/taskinstance.py:1109-1159`
+- [BEP-1030: Sokovan Scheduler Status Transition Design](BEP-1030-sokovan-scheduler-status-transition.md)
diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md
deleted file mode 100644
index d1f1cd1dddc..00000000000
--- a/proposals/BEP-1053-native-session-retry.md
+++ /dev/null
@@ -1,293 +0,0 @@
----
-Author: Jeongseok Kang (jskang@lablup.com)
-Status: Draft
-Created: 2026-04-27
-Created-Version: 26.5.0
-Target-Version:
-Implemented-Version:
----
-
-# Native Session Retry
-
-## Related Issues
-
-- JIRA: BA-5851
-- GitHub Epic: #11320
-- GitHub: #11321
-
-## Motivation
-
-Backend.AI core has no session-level retry. A `BATCH` session that fails — image pull error, transient agent failure, OOM, scheduler timeout, kernel non-zero exit — becomes terminal in `ERROR`, and the user must manually re-create it. The only retry-shaped logic in core today is infrastructure-level: DB transaction retry (`account_manager/models/utils.py:execute_with_txn_retry`), kernel restart on the agent (`agent.py:RestartTracker`), and `tenacity`-wrapped HTTP/socket retries. None of these handle "the session as a whole failed; create a fresh one with the same spec."
-
-The retry concern is therefore pushed to every higher-level orchestrator on top of Backend.AI, each of which re-implements the same logic with inconsistent semantics. Lifting retry into core gives one source of truth, resilience for plain batch workloads, and lets orchestrators thin out their own retry layers.
-
-### Goals
-
-- Opt-in automatic retry for `BATCH` sessions with a `RetryPolicy` accepted at session creation.
-- Each retry is a fresh session linked to its parent — no kernel reuse, no new status state.
-- Default `max_retries=0` keeps current behavior intact.
-- A single user-facing knob: setting `max_retries > 0` retries on any non-permanent failure.
-
-## Current Design
-
-### Session lifecycle
-
-`SessionStatus` (`src/ai/backend/manager/data/session/types.py:30-50`) defines the lifecycle:
-
-```
-PENDING → SCHEDULED → PREPARING → PULLING → PREPARED → CREATING → RUNNING → TERMINATING → TERMINATED
-```
-
-`terminal_statuses()` (line 109) is `{ERROR, TERMINATED, CANCELLED}` — no transitions out. `retriable_statuses()` (line 118) is unrelated to this BEP: it tells the scheduler which **startup** states are still safe to re-dispatch *within the same session*. This BEP introduces a separate concept — re-creating a fresh session after the previous one has gone terminal.
-
-### Session creation path
-
-```
-POST /v2/sessions
-  → CreateFromParamsAction
-  → SessionService.create_from_params (services/session/service.py:255)
-  → repository → SessionRow (models/session/row.py:384)
-```
-
-`SessionRow.creation_id` (lines 389–390) is a 32-character idempotency key reused across kernel placements; today it is generated as `secrets.token_urlsafe(16)` (`services/session/service.py:1593`). It is **not** extended to encode retry chains — those use a separate first-class column (see Data Model).
-
-There are no fields for `parent_session_id`, `retry_count`, `max_retries`, or a retry policy on `SessionRow`.
-
-### Termination event handling
-
-`SessionEventHandler` (`event_dispatcher/handlers/session.py:52`) already subscribes to the relevant events:
-
-| Method | Event | Line |
-|---|---|---|
-| `handle_session_started` | `SessionStartedAnycastEvent` | 88 |
-| `handle_session_cancelled` | `SessionFailureAnycastEvent` | 105 |
-| `handle_session_terminating` | `SessionTerminatingAnycastEvent` | 118 |
-| `handle_session_terminated` | `SessionTerminatedAnycastEvent` | 130 |
-
-`handle_session_terminated` already consults `session.status_data["error"]` for endpoint-route bookkeeping, so the failure metadata needed for retry classification is already on hand at this point. What is missing is the decision: "should we spawn a child session?"
-
-No prior BEP covers session retry or fault tolerance. BEP-1030 (scheduler status transitions) covers in-session retries by the scheduler, not session re-creation.
-
-## Proposed Design
-
-### Mental model
-
-`max_retries > 0` means "retry on failure." Users should not need to opt in twice. Apache Airflow takes the same stance — any non-fatal exception triggers retry up to `retries`. Classification's job is only to exclude failures that semantically must not retry (cancellation, validation, quota), not to gate ordinary failure modes.
-
-### `RetryPolicy` schema
-
-A Pydantic DTO at `src/ai/backend/common/dto/manager/v2/session/retry_policy.py`, matching the v2 DTO location used by other recent BEPs. Per `src/ai/backend/manager/data/CLAUDE.md`, `data/` is reserved for frozen dataclasses with no framework deps; Pydantic models live under `common/dto/` so they can be shared across REST v2 and GraphQL. Schema modeled on Airflow's parameter surface:
-
-```python
-class BackoffStrategy(StrEnum):
-    FIXED = "fixed"
-    EXPONENTIAL = "exponential"
-
-class JitterMode(StrEnum):
-    NONE = "none"
-    DETERMINISTIC = "deterministic"
-    RANDOM = "random"
-
-class RetryEligibleCause(StrEnum):
-    AGENT_TRANSIENT = "agent_transient"
-    SCHEDULER_TIMEOUT = "scheduler_timeout"
-    IMAGE_PULL_FAILURE = "image_pull_failure"
-    KERNEL_NONZERO_EXIT = "kernel_nonzero_exit"
-    OOM_KILLED = "oom_killed"
-    UNKNOWN = "unknown"
-
-    @classmethod
-    def defaults(cls) -> frozenset["RetryEligibleCause"]:
-        return frozenset(cls)
-
-class RetryPolicy(BaseModel):
-    max_retries: NonNegativeInt = 0
-    retry_delay: PositiveFloat = 60.0
-    backoff: BackoffStrategy = BackoffStrategy.FIXED
-    backoff_multiplier: PositiveFloat = 2.0
-    max_retry_delay: PositiveFloat | None = 3600.0
-    jitter: JitterMode = JitterMode.DETERMINISTIC
-    jitter_ratio: confloat(ge=0, le=1) = 0.25
-    eligible_causes: frozenset[RetryEligibleCause] = Field(
-        default_factory=RetryEligibleCause.defaults
-    )
-    emit_retry_events: bool = True
-```
-
-Notable deviations from Airflow:
-
-- **No callback parameter.** Backend.AI is event-driven; downstream consumers subscribe to `session.retry_*` events instead of registering an `on_retry_callback`. Keeps the policy serializable and the server's behavior fully auditable.
-- **Structural cause enum, not exception types.** Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process; classification reads `status_data` instead.
-- **`max_retries` is a count.** Total attempts = `1 + max_retries`, matching Backend.AI naming and the existing pipeline orchestrator on top of Backend.AI.
-
-### Failure classification
-
-A central `classify_failure(session, status_data) → RetryEligibleCause`. Hardcoded never-retriable causes live outside the enum: `USER_CANCELLED`, `VALIDATION_ERROR`, `QUOTA_EXCEEDED`. Users cannot opt these into retry.
-
-| Cause | In default eligible set | Notes |
-|---|---|---|
-| `AGENT_TRANSIENT` | yes | Lost heartbeat, agent restart mid-run. |
-| `SCHEDULER_TIMEOUT` | yes | Kernel-creation timeout under cluster pressure. |
-| `IMAGE_PULL_FAILURE` | yes | Typo wastes a few seconds with backoff; registry blip is real. |
-| `KERNEL_NONZERO_EXIT` | yes | The most common reason batch users want retry. |
-| `OOM_KILLED` | yes | Retry without resource bump usually fails again, but exhausting `max_retries` is cheap. |
-| `UNKNOWN` | yes | Conservative for unclassified failures. |
-| `USER_CANCELLED` | hardcoded never | Permanent. |
-| `VALIDATION_ERROR` / `QUOTA_EXCEEDED` | hardcoded never | Permanent. |
-
-### Backoff formula
-
-```
-base = retry_delay                                                    if backoff == FIXED
-       min(retry_delay * backoff_multiplier ** retry_count,            otherwise
-           max_retry_delay or MAX_RETRY_DELAY)
-delay = apply_jitter(base, mode=jitter, ratio=jitter_ratio,
-                     seed=(session_id, retry_count))
-delay = min(delay, max_retry_delay or MAX_RETRY_DELAY)
-```
-
-`MAX_RETRY_DELAY` is a hard 24 h ceiling, matching Airflow. Deterministic jitter is `SHA1(session_id || retry_count) mod (base * jitter_ratio)`, yielding reproducible delays — useful for tests. Random jitter samples uniformly in `[base, base * (1 + jitter_ratio))`.
-
-### Defaults and kill switch
-
-Two distinct concepts, kept separate to avoid the precedence trap of "default doubles as kill switch":
-
-- **Cluster default** in etcd: `config/manager/retry_policy_default` (ship default `max_retries=0` — no behavior change). This is a default; per-session policy wins on merge. Effective policy = deep-merge of cluster default and per-session policy.
-- **Cluster kill switch** in etcd: `config/manager/retry_disabled` (boolean, default `false`). Loaded at startup and refreshed via the existing `EtcdConfigWatcher` (`manager/config/provider.py:20`) so changes propagate without per-event etcd reads. Checked at the **top** of the decision flow, before any policy merge, **and** by the dispatcher worker before claiming a queue row — so flipping the switch mid-incident also halts in-flight queued retries. When `true`, no retries are scheduled or dispatched regardless of per-session policy. Useful for incident response (e.g., disabling retries cluster-wide during a registry outage that would otherwise cause a retry storm).
-
-**Project / domain default is deferred.** [BEP-1052 (Scoped App Config Redesign)](BEP-1052-scoped-app-config-redesign.md) is concurrently rewriting the project / domain config surface around scoped `AppConfigFragment` rows. Adding `retry_policy_default` to the legacy project config row would conflict with that work. After BEP-1052 lands, a follow-up BEP can wire retry defaults into `AppConfigFragment` as a third precedence layer.
-
-### Data model
-
-One Alembic migration adds to `sessions`:
-
-| Column | Type | Description |
-|---|---|---|
-| `parent_session_id` | `UUID NULL` | Self-FK to `sessions.id`; null for the first attempt. |
-| `retry_count` | `INT NOT NULL DEFAULT 0` | 0 for the first attempt. |
-| `max_retries` | `INT NOT NULL DEFAULT 0` | Denormalized from policy for cheap filters. |
-| `retry_policy` | `JSONB NULL` | Full policy. |
-| `retry_cause` | `TEXT NULL` | Classified cause that triggered the most recent retry into this attempt. |
-
-The migration also adds a **partial unique index** on `(parent_session_id, retry_count) WHERE parent_session_id IS NOT NULL`. This is the actual idempotency guarantee for retry dispatch: even if two workers race past the parent row lock (different transactions, different timing), the second `INSERT` of a child with the same `(parent, attempt-number)` fails on the unique violation. `creation_id` is unchanged and remains a per-attempt random token (no retry encoding).
-
-A second small table `session_retry_dispatch_queue` is added for durable delayed dispatch (see "Decision and dispatch"):
-
-| Column | Type | Description |
-|---|---|---|
-| `parent_session_id` | `UUID NOT NULL` | FK to `sessions.id`. |
-| `retry_count` | `INT NOT NULL` | Target attempt number (= parent.retry_count + 1). |
-| `scheduled_at` | `TIMESTAMPTZ NOT NULL` | Earliest dispatch time. |
-| `claimed_at` | `TIMESTAMPTZ NULL` | Set when a dispatcher worker claims the row. |
-| `dispatched_at` | `TIMESTAMPTZ NULL` | Set when the child session has been created. |
-
-Primary key `(parent_session_id, retry_count)` — the same constraint that protects `sessions` also serializes queue inserts. The queue lets retry decisions survive manager restarts, mirrors the durable-outbox pattern used by the existing pipeline orchestrator on top of Backend.AI, and avoids inventing in-memory scheduling.
-
-`parent_session_id`, `retry_count`, and `max_retries` are first-class columns because they appear in filters, joins, and the unique index; the rest live in JSONB. `parent_session_id` is the canonical query for "show me the retry chain of this session." **No new history table** — the chain is already a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. The migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`.
-
-### Decision and dispatch
-
-The retry decision is **folded into the existing `SessionEventHandler.handle_batch_result`** (`event_dispatcher/handlers/session.py:152`), not added as a sibling handler. Rationale:
-
-- `SessionFailureAnycastEvent` is **already** subscribed by `handle_batch_result` (`event_dispatcher/dispatch.py:520`). Adding a second handler on the same event would race against bookkeeping work (`set_session_result`, etc.) and depend on undefined dispatch ordering.
-- Failure metadata (`session.status_data["error"]`) is already loaded in `handle_batch_result` for the existing failure path; the retry decision can reuse it without new DB roundtrips.
-- The handler runs after the session has reached a terminal status, so parent state is settled, and the change does not interact with the recently refactored sokovan termination flow (#11250 — `mark_sessions_for_termination()` in `sokovan/scheduling_controller/scheduling_controller.py:266`).
-
-A sokovan post-processor was considered but rejected for v1: post-processors run *during* scheduling iterations, complicating idempotency and timing without adding capability the event-handler path lacks.
-
-**Dispatch primitive.** `BackgroundTaskManager.start_retriable()` (`common/bgtask/bgtask.py:444`) is **not** suitable: it accepts no `delay` parameter and fires the task immediately via `asyncio.create_task` ("retriable" refers to the task body retrying on failure, not delayed scheduling). Instead, retries are persisted to the new `session_retry_dispatch_queue` table (see Data Model). A periodic loop in the manager (placed under `sokovan/` alongside other periodic workers, e.g., `sokovan/scheduler/retry_dispatcher.py`) claims **one row at a time** via:
-
-```sql
-UPDATE session_retry_dispatch_queue
-SET claimed_at = now()
-WHERE (parent_session_id, retry_count) = (
-    SELECT parent_session_id, retry_count
-    FROM session_retry_dispatch_queue
-    WHERE scheduled_at <= now()
-      AND claimed_at IS NULL
-      AND dispatched_at IS NULL
-    ORDER BY scheduled_at
-    FOR UPDATE SKIP LOCKED
-    LIMIT 1
-)
-RETURNING parent_session_id, retry_count;
-```
-
-`FOR UPDATE SKIP LOCKED` lets multiple manager replicas claim disjoint rows without contention, and `LIMIT 1` avoids multi-row claim deadlocks. The worker invokes `SessionService.create_from_params()` for the claimed row and stamps `dispatched_at = now()` on success or `dispatched_at = '1970-01-01'::timestamptz` (sentinel) on failure. This pattern is durable across manager restarts and matches the outbox approach used by the sibling pipeline orchestrator.
-
-The decision flow inside `handle_batch_result` (in the `SessionFailureAnycastEvent` arm):
-
-1. Load the parent session. **Short-circuit unless `session.session_type == SessionTypes.BATCH`** — interactive and inference sessions share `SessionEventHandler` and are out of scope for v1. Then, if `retry_count >= max_retries`, emit `session.retry_exhausted` and return.
-2. Classify failure via `classify_failure(session, status_data)`. The shape of `status_data["error"]` is defined by `manager/exceptions.py:convert_to_status_data` (returns `ErrorStatusInfo` / `ErrorDetail` TypedDicts at line 97); classification reads `error.name` and `error.src` to map to a `RetryEligibleCause`. **Malformed-input fallback:** if `status_data` is `None`, `status_data["error"]` is missing, or required keys (`name`, `src`) are missing, `classify_failure` does **not** return `UNKNOWN`. Instead it logs a WARNING and returns the hardcoded never-retriable sentinel — a malformed error envelope is treated as a permanent failure to avoid retry storms on serialization bugs. Only well-formed failures with an unrecognized `error.name` map to `UNKNOWN`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return.
-3. Inside the session repository's `begin_session()` transaction, lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` and re-read `retry_count`.
-4. Compute `delay` per the formula above.
-5. `INSERT` into `session_retry_dispatch_queue` with `(parent_session_id, retry_count + 1, now() + delay)`. The PK on `(parent_session_id, retry_count)` makes this idempotent: a duplicate dispatch (handler replay, concurrent handlers) hits a unique-violation and the `INSERT` is skipped. Emit `session.retry_scheduled`.
-6. The dispatcher worker eventually claims the row, runs `SessionService.create_from_params` with a `CreateFromParamsAction` derived from the parent, and stamps `dispatched_at`. The child `INSERT` is the second idempotency boundary: the partial unique index on `sessions.(parent_session_id, retry_count)` rejects duplicate child rows even if two workers claim the same queue row through PG bug, replication lag, or operational error.
-
-The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`. The `CreateFromParamsAction` carries the same image, mounts, `resource_slots`, env, cluster spec, and batch entrypoint as the parent.
-
-**Failure mode of the retry decision.** If `classify_failure` raises, the session stays in its terminal state and the failure is logged at ERROR level — no retry, no crash propagation. If the queue `INSERT` fails (DB unavailable), the parent's `status_data` is annotated with the dispatch failure and `session.retry_exhausted` is emitted. The decision must not raise out of `handle_batch_result`; an unhandled exception there would also break existing batch-result bookkeeping.
-
-**Failure mode of the dispatcher worker.** If `create_from_params` raises after the queue row is claimed, the worker stamps `dispatched_at` to a sentinel value and emits `session.retry_exhausted` with the underlying error. Manager restart while a row is claimed-but-not-dispatched: the worker re-claims rows whose `claimed_at` is older than a configurable lease (e.g., 5 minutes) on startup.
-
-**No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` (resolved at the API layer, not stored) tells clients "attempt N of M" or "this session has a pending child." This avoids touching the scheduler state machine entirely.
-
-### API surface
-
-REST v2 (`api/rest/v2/sessions/`):
-
-| Method | Path | Purpose |
-|---|---|---|
-| `POST` | `/sessions` | Accept optional `retry_policy` in `SessionCreateRequest`. |
-| `GET` | `/sessions/{id}` | Return `parent_session_id`, `retry_count`, `max_retries`, `retry_policy`, `retry_cause`, plus computed `retry_chain` (oldest → newest IDs). |
-| `GET` | `/sessions/{id}/attempts` | Return the chain with the status of each attempt. |
-
-GraphQL v2: mirror in `api/gql/session/types.py` — `parentSession`, `retryCount`, `maxRetries`, `retryPolicy`, `retryCause`, `retryChain` resolver.
-
-Client SDK v2 + CLI v2: expose the new fields; `./bai session info` shows `attempt N of M` and links to the parent.
-
-No retry mutation in v1; manual retry is deferred until the auto path stabilizes.
-
-### Observability
-
-- Counters: `bai_session_retry_scheduled_total{cause}`, `bai_session_retry_exhausted_total{cause}`, `bai_session_retry_succeeded_total`.
-- Events: `session.retry_scheduled`, `session.retry_exhausted` — consumable by the webhook plugin, replacing the role of Airflow's `on_retry_callback` for downstream consumers.
-- Audit log entry per retry dispatch: cause and attempt N of M.
-
-## Migration / Compatibility
-
-- Default `max_retries=0` keeps behavior unchanged for every existing caller.
-- All new columns are nullable or default to safe zero values; the Alembic migration is purely additive.
-- Existing GraphQL and REST clients continue to work; new fields are additive on responses.
-- Operators opt in by setting the cluster default in etcd or a per-session policy.
-- External orchestrators may continue using their own retry layers; migration to native retry is independent and incremental.
-- No breaking changes.
-
-### Quota and accounting
-
-A retry attempt is a fresh `SessionRow` and counts against the user's concurrent-session limit while it is alive — same as if the user had re-submitted manually. The previous attempt's resource consumption is not refunded; this matches the principle that "actual GPU/CPU time was spent, regardless of why the session ended." The API exposes the chain so accounting tools can group attempts under one logical job if they choose.
-
-### Operational kill switch
-
-`config/manager/retry_disabled` (etcd, boolean) is the cluster-level kill switch — see "Defaults and kill switch" above. Per-project / per-user kill switches are deferred until the project-default layer lands (see [BEP-1052](BEP-1052-scoped-app-config-redesign.md) dependency above).
-
-## Implementation Plan
-
-Six PRs, each tracked by its own sub-issue under #11320:
-
-1. **BEP draft** (this document) — #11321.
-2. **Foundation:** `RetryPolicy` DTO, `classify_failure`, backoff utility with deterministic jitter. Pure, no I/O, unit-test heavy.
-3. **Schema:** Alembic migration adding `SessionRow` retry columns (with the partial unique index) and the `session_retry_dispatch_queue` table; repository read/write for the retry chain.
-4. **Retry decision:** fold the decision into `handle_batch_result` in `SessionEventHandler`, queue insert with idempotency, etcd kill switch and cluster default, counters/events/audit.
-5. **Dispatcher worker:** periodic claim loop on `session_retry_dispatch_queue`, `SessionService.create_from_params` extension to inherit retry context, lease-based recovery on manager restart.
-6. **API surface:** REST v2 and GraphQL v2 fields, `attempts` endpoint.
-7. **Client:** SDK v2, CLI v2 (`./bai session info` retry view), user docs.
-
-Tests live with the code under test. Cross-cutting integration tests — transient → retry → success, exhaustion, concurrent dispatch idempotency, jitter determinism, manager-restart recovery of claimed-but-undispatched queue rows — ship with the dispatcher-worker PR. Estimated effort: four to five weeks for one engineer.
-
-## References
-
-- Working draft: `docs/investigation/native-session-retry-plan.md`
-- Apache Airflow retry implementation: `airflow-core/src/airflow/models/taskinstance.py:1109-1159`
-- [BEP-1030: Sokovan Scheduler Status Transition Design](BEP-1030-sokovan-scheduler-status-transition.md)
-- Alembic backport strategy: `src/ai/backend/manager/models/alembic/README.md`
diff --git a/proposals/BEP-1054-session-rescheduling-on-terminal-failure.md b/proposals/BEP-1054-session-rescheduling-on-terminal-failure.md
new file mode 100644
index 00000000000..44a07e5d60b
--- /dev/null
+++ b/proposals/BEP-1054-session-rescheduling-on-terminal-failure.md
@@ -0,0 +1,169 @@
+---
+Author: Jeongseok Kang (jskang@lablup.com)
+Status: Draft
+Created: 2026-04-27
+Created-Version: 26.5.0
+Target-Version:
+Implemented-Version:
+---
+
+# Session Rescheduling on Terminal Failure
+
+## Related Issues
+
+- JIRA: BA-5851
+- GitHub Epic: #11320
+- GitHub: #11321
+- Companion BEP: [BEP-1053 — Agent-level Batch Retry](BEP-1053-agent-batch-retry.md)
+
+## Motivation
+
+Some session failures are **node-level**: the kernel was OOM-killed on this host, the agent disconnected mid-run, the registry route used by this scaling group is briefly down, the network namespace setup failed for a node-specific reason. For these cases, re-running the script in place — Backend.AI's existing scheduler-internal retries, or BEP-1053's agent-level batch retry — does not help. What does help is **rescheduling the same session to a different node**, with the same resource allocation.
+
+Today, terminal-failure sessions stay terminal. There is no path that takes a session in `ERROR` and pushes it back through the scheduler. Operators have to ask users to re-create their sessions, often after diagnosing that the failure was the host's fault, not the user's. This BEP closes that gap.
+
+It is the companion to [BEP-1053](BEP-1053-agent-batch-retry.md), which handles in-script retry; together they cover the two distinct retry surfaces. They are designed to ship independently.
+
+### Goals
+
+- Re-dispatch a terminal-failed `BATCH` session through the scheduler when the failure is classified as **node-level**.
+- Reuse existing scheduler infrastructure: `SessionLifecycleHandler`, `phase_attempts`, scheduling history, the `expired → PENDING` transition pattern.
+- Make failure classification **operator-extensible** — etcd-driven pattern config, not a closed enum in code.
+- Promote the standing `SERVICE_MAX_RETRIES = 5  # FIXME: make configurable` (`manager/defs.py:121`) to a real configuration knob as a side effect.
+- Default off; opt-in per scaling group.
+
+### Non-goals
+
+- Mutating resource allocation (no "give it more memory and retry"). Resource decisions stay with the user/admin.
+- User-facing per-session `RetryPolicy` with backoff/jitter/max. Rescheduling is operator-policy, not user-policy.
+- Interactive or inference sessions. INTERACTIVE is user-driven; INFERENCE has BEP-1049 deployment-route handling.
+- Re-running the user script in place. That is BEP-1053's job.
+
+## Current Design
+
+### Session lifecycle and terminal status
+
+`SessionStatus` (`src/ai/backend/manager/data/session/types.py:30-50`) defines the lifecycle. `terminal_statuses()` (line 109) is `{ERROR, TERMINATED, CANCELLED}` — no transitions out today. `retriable_statuses()` (line 118) is the scheduler's *in-session* retriable set; it does not apply to sessions already in `ERROR`.
+
+### Sokovan lifecycle handlers
+
+Periodic `SessionLifecycleHandler`s drive scheduler decisions (`sokovan/scheduler/handlers/`). Each declares `success / need_retry / expired / give_up` outcomes and the status transitions for each (`base.py:62-93`). Existing handlers include `CheckPreconditionLifecycleHandler` and `StartSessionsLifecycleHandler`, which use the **`expired → PENDING`** transition pattern (`check_precondition.py:67`, `start_sessions.py:78`) — the canonical "re-schedule this session" mechanism, scoped today to startup-stage timeouts.
+
+### Existing counters and caps
+
+- `phase_attempts` (`sokovan/data/lifecycle.py:322`): per-session attempt counter sourced from scheduling history (`coordinator.py:756`). Documented as "give_up when >= max_retries."
+- `SERVICE_MAX_RETRIES = 5  # FIXME: make configurable` (`manager/defs.py:121`): the global cap, used by both session and deployment coordinators (`coordinator.py:1228`, `deployment/coordinator.py:764`).
+
+### Failure metadata
+
+When a session fails, `SessionRow.status_data` carries `{"error": {"name": ..., "src": ...}}` per `manager/exceptions.py:convert_to_status_data` and the `ErrorStatusInfo` / `ErrorDetail` TypedDicts (line 97). The shape is stable.
+
+### What is missing
+
+A handler that fires on **terminal-failure** sessions, classifies the failure, and either rescheduples or accepts the failure. Today's handlers run on non-terminal sessions only.
+
+## Proposed Design
+
+### A new lifecycle handler: `RescheduleFailedBatchSessionsLifecycleHandler`
+
+Lives at `sokovan/scheduler/handlers/lifecycle/reschedule_failed_batch.py`, alongside the existing handlers. Targets sessions where:
+
+- `session_type == SessionTypes.BATCH`
+- `status == ERROR`
+- `phase_attempts < effective_max_retries`
+- `status_data["error"]` classifies as a *reschedulable* cause (see "Classification" below).
+
+Outcomes:
+
+- **`success`** (rescheduling fired): transition `ERROR → PENDING`. Re-uses the existing `expired → PENDING` machinery, just from a new starting status. Increments `phase_attempts` via the standard scheduling-history append.
+- **`give_up`** (cap reached, or cause not reschedulable): no transition. Session stays in `ERROR`.
+- **`need_retry`** (transient inability to act, e.g., DB contention): no transition; handler retries next cycle.
+
+The handler reuses **everything** the existing lifecycle handlers reuse: `phase_attempts` from scheduling history is the counter, `SERVICE_MAX_RETRIES` (now configurable, see below) is the cap, the lifecycle-coordinator path applies the transition. No new column on `SessionRow`. No queue table. No child sessions.
+
+### Same session, not a child
+
+A reschedule keeps the original `SessionRow` — same `id`, same `creation_id`, same kernels record, same resource allocation. The session re-enters `PENDING` with `phase_attempts` incremented; the scheduler picks a new agent on the next dispatch cycle. The kernels associated with the previous attempt are cleaned up as part of the terminal-state transition that already runs today.
+
+This is intentionally different from the original BEP-1053 draft: there are no parent-child rows, no retry chain, no `parent_session_id`. The "history" of attempts is what scheduling history already records.
+
+### Failure classification — extensible, not closed
+
+A closed enum of causes hardcodes runtime behavior into code; site-specific failure signatures (vendor accelerator faults, registry-specific image-pull errors, custom-plugin failures) cannot be classified without a manager release. Replace the closed enum with a **pattern-based config**, loaded from etcd and refreshed via `EtcdConfigWatcher` (`manager/config/provider.py:20`):
+
+```yaml
+# config/manager/session_failure_classification
+default: give_up
+by_error_name:
+  OOMError: reschedule
+  AgentDisconnected: reschedule
+  ImagePullError: give_up      # agent's tenacity already retried
+  HeartbeatTimeout: reschedule
+  ValidationError: give_up
+  QuotaExceededError: give_up
+by_error_src:
+  agent: reschedule            # fallback for agent-side errors not named above
+```
+
+Resolution order: `by_error_name` (most specific) → `by_error_src` → `default`. The result is one of three closed `Action` values: `reschedule`, `give_up`, or `ignore` (do not handle yet — leave for the next cycle, used rarely).
+
+The **action catalog** stays a closed enum (the manager has to know what each action means), but the **cause catalog** is open: operators add patterns without code changes.
+
+Hardcoded never-reschedulable causes: `USER_CANCELLED` (user intent), and any cause that originates *after* the session reached `RUNNING` and the user's script started — those are BEP-1053's domain. The handler short-circuits on these regardless of config.
+
+### `SERVICE_MAX_RETRIES` becomes configurable
+
+Same etcd path: `config/manager/scheduler_max_retries`. Read at startup, refreshed via `EtcdConfigWatcher`. Per-scaling-group overrides under `config/scaling-groups/{sg_name}/scheduler_max_retries`. Default `5` (matches current constant). The handler resolves the cap from scaling-group config first, then cluster, then default. Closes the standing `FIXME: make configurable`.
+
+### Kill switch
+
+`config/manager/reschedule_disabled` (etcd boolean, default `false`). Loaded at startup, watched. Checked at the top of the handler's per-cycle execution. When `true`, the handler is a no-op for that cycle. Useful for incident response (e.g., stop rescheduling cluster-wide during a cascade).
+
+### Observability
+
+- Counters: `bai_session_reschedule_attempted_total{cause}`, `bai_session_reschedule_capped_total{cause}` (cap reached), `bai_session_reschedule_succeeded_total` (subsequent attempt reached `RUNNING`).
+- Event: `session.rescheduled` emitted when `ERROR → PENDING` transition fires. Reuses the existing event-publication path from the lifecycle coordinator.
+- Audit log entry per reschedule: `(session_id, cause, attempt N of M, source_agent, target_after = scheduler_choice)`.
+- The existing scheduling-history rows already record per-attempt timestamps and outcomes; that is the durable trail.
+
+## Migration / Compatibility
+
+### Backward compatibility
+
+- Default `reschedule_disabled = false` *and* default classification config produces no `reschedule` actions for any cause. So **the feature is effectively off until an operator populates the classification config** — zero behavior change on rollout.
+- All etcd keys are additive; no existing key changes shape.
+- No Alembic migration required.
+- `SERVICE_MAX_RETRIES` constant in `manager/defs.py:121` remains as the default if the etcd key is absent. The `FIXME` is closed; the constant becomes a fallback.
+
+### Quota and accounting
+
+A reschedule does not create a new `SessionRow`, so concurrent-session limits are unaffected. Resource consumption from the previous attempt is not refunded — the user *did* consume those resources on the failed node — but the next attempt re-uses the same allocation request, so quota is not double-counted.
+
+### Interaction with BEP-1053
+
+The two BEPs are designed to compose:
+
+- **BEP-1053** runs first inside the failing kernel; non-zero exit → re-run script; only if all attempts fail does the agent emit `SessionFailureAnycastEvent`.
+- **BEP-1054** then evaluates the resulting terminal-failure session. If the cause is node-level, the scheduler reschedules. If the cause is "user script failed after all in-place retries," the classification config maps it to `give_up` and the session stays terminal.
+
+A session can therefore experience: agent-side script retries → manager-side reschedule → on a new node, agent-side script retries again. Each attempt's history is recorded in scheduling history; users see one logical job, operators see the full trail.
+
+## Implementation Plan
+
+Five PRs, each tracked under #11320:
+
+1. **BEP draft** (this document and the companion BEP-1053) — #11321.
+2. **Foundation:** `FailureClassifier` (pattern-based, etcd-driven, refreshed via `EtcdConfigWatcher`) and the `Action` enum. Pure logic, unit-test heavy.
+3. **`SERVICE_MAX_RETRIES` configurability:** etcd source + per-scaling-group override + fallback to the `defs.py` constant. Closes the standing FIXME.
+4. **Lifecycle handler:** `RescheduleFailedBatchSessionsLifecycleHandler`, kill switch, the `ERROR → PENDING` transition (extending the existing pattern to a new starting status), counters/events/audit.
+5. **API surface:** session info responses include `reschedule_count` (= `phase_attempts` view) and the latest `reschedule_cause`. No mutation; this is read-only observability.
+6. **Client:** SDK v2 + CLI v2 surface the new info fields; user docs.
+
+Tests live with the code under test. Cross-cutting integration tests — node-level failure → reschedule → success on different agent; cap-reached → terminal; classification-config-empty → terminal; kill-switch-on → no rescheduling — ship with the lifecycle-handler PR. Estimated effort: two to three weeks for one engineer.
+
+## References
+
+- Companion: [BEP-1053 — Agent-level Batch Retry](BEP-1053-agent-batch-retry.md)
+- Working draft and design pivot rationale: `docs/investigation/bep-1053-design-pivot.md`
+- [BEP-1030: Sokovan Scheduler Status Transition Design](BEP-1030-sokovan-scheduler-status-transition.md)
+- [BEP-1049: Zero-Downtime Deployment Strategy Architecture](BEP-1049-deployment-strategy-handler.md) — analogous handler-pattern for routes
diff --git a/proposals/README.md b/proposals/README.md
index f93b0b63f5b..590085e2753 100644
--- a/proposals/README.md
+++ b/proposals/README.md
@@ -123,7 +123,8 @@ BEP numbers start from 1000.
 | [1050](BEP-1050-prometheus-query-preset-system.md) | Prometheus Query Preset System | BoKeum Kim | Draft |
 | [1051](BEP-1051-kata-containers-agent.md) | Kata Containers Agent Backend | Kyujin Cho | Draft |
 | [1052](BEP-1052-scoped-app-config-redesign.md) | Scoped App Config Redesign | Gyubong Lee | Draft |
-| [1053](BEP-1053-native-session-retry.md) | Native Session Retry | Jeongseok Kang | Draft |
+| [1053](BEP-1053-agent-batch-retry.md) | Agent-level Batch Retry | Jeongseok Kang | Draft |
+| [1054](BEP-1054-session-rescheduling-on-terminal-failure.md) | Session Rescheduling on Terminal Failure | Jeongseok Kang | Draft |
 | _next_ | _(reserve your number here)_ | | |
 
 ## File Structure