From d1360ffe3a35be66bfdf749cb975677760c2959b Mon Sep 17 00:00:00 2001 From: Jeongseok Kang Date: Mon, 27 Apr 2026 10:17:45 +0900 Subject: [PATCH 1/9] docs: add BEP-1053 native session retry proposal Captures the design for adding native session-level retry to Backend.AI core, modeled after Apache Airflow's RetryPolicy parameter surface and adapted to Backend.AI's event-driven architecture. Subsequent implementation work tracked under epic. Refs: #11320, #11321 Co-Authored-By: Claude Opus 4.7 (1M context) --- proposals/BEP-1053-native-session-retry.md | 263 +++++++++++++++++++++ proposals/README.md | 1 + 2 files changed, 264 insertions(+) create mode 100644 proposals/BEP-1053-native-session-retry.md diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md new file mode 100644 index 00000000000..597634863e0 --- /dev/null +++ b/proposals/BEP-1053-native-session-retry.md @@ -0,0 +1,263 @@ +--- +Author: Jeongseok Kang (jskang@lablup.com) +Status: Draft +Created: 2026-04-27 +Created-Version: 26.5.0 +Target-Version: +Implemented-Version: +--- + +# Native Session Retry + +## Related Issues + +- GitHub Epic: #11320 +- GitHub: #11321 + +## Motivation + +Backend.AI core has no session-level retry. A `BATCH` session that fails — image pull error, transient agent failure, OOM, scheduler timeout, kernel non-zero exit — becomes terminal in `ERROR`, and the user must manually re-create it. + +The only retry-shaped logic in core today is infrastructure-level: DB transaction retry (`account_manager/models/utils.py`), kernel restart on the agent (`agent/agent.py:restarting_kernels`), and `tenacity`-wrapped HTTP/socket retries. None of these handle "the session as a whole failed; create a fresh one with the same spec." + +This pushes the retry concern out to every higher-level orchestrator on top of Backend.AI. Each one re-implements the same logic, with inconsistent semantics. Pushing retry into core gives: + +- A single source of truth for retry semantics — backoff, jitter, eligibility — shared by every caller. +- Resilience for plain batch workloads without requiring an external orchestrator. +- Reduced duplication; orchestrators above Backend.AI can thin out their retry layers. + +## Current Design + +Session statuses are defined in `src/ai/backend/manager/data/session/types.py:30-51`: + +``` +PENDING → SCHEDULED → PREPARING → PULLING → PREPARED → CREATING → RUNNING → TERMINATING → TERMINATED +``` + +Terminal statuses with no further transitions: `ERROR`, `TERMINATED`, `CANCELLED`. `SessionStatus.retriable_statuses()` (line 118) classifies which startup states are scheduling-retriable, but there is no notion of *re-creating* a terminal `ERROR` session. + +Session creation flows through `API handler → SessionService.create_from_params() → repository → SessionRow`. `SessionRow.creation_id` already exists as an idempotency key. There are no fields for `parent_session_id`, `retry_count`, `max_retries`, or a retry policy. + +The termination event handler (`event_dispatcher/handlers/session.py`) listens to `session.terminated` / `session.error` but has no retry decision hook. + +No prior BEP covers session retry or fault tolerance. + +## Proposed Design + +### Mental model + +`max_retries > 0` means "retry on failure." Users should not need to opt in twice. Apache Airflow takes the same stance — any non-fatal exception triggers retry up to `retries`. The classification's job is only to exclude failures that semantically must not retry (cancellation, validation, quota), not to gate ordinary failure modes. + +### `RetryPolicy` schema + +A Pydantic DTO accepted at session creation, modeled on Airflow's parameter surface: + +```python +class BackoffStrategy(StrEnum): + FIXED = "fixed" + EXPONENTIAL = "exponential" + +class JitterMode(StrEnum): + NONE = "none" + DETERMINISTIC = "deterministic" + RANDOM = "random" + +class RetryEligibleCause(StrEnum): + AGENT_TRANSIENT = "agent_transient" + SCHEDULER_TIMEOUT = "scheduler_timeout" + IMAGE_PULL_FAILURE = "image_pull_failure" + KERNEL_NONZERO_EXIT = "kernel_nonzero_exit" + OOM_KILLED = "oom_killed" + UNKNOWN = "unknown" + + @classmethod + def defaults(cls) -> frozenset["RetryEligibleCause"]: + return frozenset({ + cls.AGENT_TRANSIENT, cls.SCHEDULER_TIMEOUT, + cls.IMAGE_PULL_FAILURE, cls.KERNEL_NONZERO_EXIT, + cls.OOM_KILLED, cls.UNKNOWN, + }) + +class RetryPolicy(BaseModel): + max_retries: NonNegativeInt = 0 + retry_delay: PositiveFloat = 60.0 + backoff: BackoffStrategy = BackoffStrategy.FIXED + backoff_multiplier: PositiveFloat = 2.0 + max_retry_delay: PositiveFloat | None = 3600.0 + jitter: JitterMode = JitterMode.DETERMINISTIC + jitter_ratio: confloat(ge=0, le=1) = 0.25 + eligible_causes: frozenset[RetryEligibleCause] = Field( + default_factory=RetryEligibleCause.defaults + ) + emit_retry_events: bool = True +``` + +Mapping to Airflow: + +| Airflow | `RetryPolicy` | +|---|---| +| `retries` | `max_retries` (count, total attempts = `1 + max_retries`) | +| `retry_delay` | `retry_delay` (seconds) | +| `retry_exponential_backoff` (multiplier) | `backoff: fixed\|exponential` + `backoff_multiplier` | +| `max_retry_delay` (with 24 h hard ceiling) | `max_retry_delay` (24 h hard ceiling preserved) | +| SHA1-deterministic jitter | `jitter` (selectable: none / deterministic / random), `jitter_ratio` | +| Exception-typed eligibility | Structural enum `RetryEligibleCause` | +| `on_retry_callback` | `session.retry_scheduled` / `session.retry_exhausted` events | +| `default_args` precedence | Per-session > project/domain default > etcd cluster default | +| `email_on_retry` | Subsumed by event subscription via webhook plugin | + +Deviations from Airflow and their reasons: + +- **No callback parameter.** Keeps the policy serializable and the server's behavior auditable. Backend.AI is event-driven; downstream consumers subscribe to `session.retry_*` events. +- **Structural cause enum, not exception types.** Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process. +- **`max_retries` is a count.** Total attempts = `1 + max_retries`, matching Backend.AI conventions and the existing pipeline orchestrator. + +### Failure classification + +A central `classify_failure(session, status_data) → RetryEligibleCause`. Hardcoded non-retriable causes outside the enum: `USER_CANCELLED`, `VALIDATION_ERROR`, `QUOTA_EXCEEDED`. Users cannot opt these into retry. + +| Cause | Default eligible | Notes | +|---|---|---| +| `AGENT_TRANSIENT` | yes | Lost heartbeat, agent restart mid-run. | +| `SCHEDULER_TIMEOUT` | yes | Kernel-creation timeout under cluster pressure. | +| `IMAGE_PULL_FAILURE` | yes | Typo wastes a few seconds with backoff; registry blip is real. | +| `KERNEL_NONZERO_EXIT` | yes | The most common reason batch users want retry. | +| `OOM_KILLED` | yes | Retry without resource bump usually fails again, but exhausting `max_retries` is cheap. | +| `UNKNOWN` | yes | Conservative for unclassified failures. | +| `USER_CANCELLED` | hardcoded never | Permanent. | +| `VALIDATION_ERROR` / `QUOTA_EXCEEDED` | hardcoded never | Permanent. | + +### Backoff formula + +``` +base = retry_delay if backoff == FIXED + min(retry_delay * backoff_multiplier ** retry_count, otherwise + max_retry_delay or MAX_RETRY_DELAY) +delay = apply_jitter(base, mode=jitter, ratio=jitter_ratio, + seed=(session_id, retry_count)) +delay = min(delay, max_retry_delay or MAX_RETRY_DELAY) +``` + +`MAX_RETRY_DELAY` is a hard 24 h ceiling. Deterministic jitter takes `SHA1(session_id || retry_count) mod (base * jitter_ratio)`, yielding reproducible delays — useful for tests. Random jitter samples uniformly in `[base, base * (1 + jitter_ratio))`. + +### Defaults precedence + +Three layers, matching Airflow's `default_args` propagation: + +1. Per-session policy in the create request. +2. Project / domain default (new optional field, admin-managed). +3. Cluster default in etcd: `config/manager/retry_policy_default`. Ship default: `max_retries=0` → no behavior change. + +Effective policy = deep-merge top-down; per-session wins. + +### Data model + +One Alembic migration adds to `sessions`: + +``` +parent_session_id : UUID NULL (self-FK) +retry_count : INT NOT NULL DEFAULT 0 +max_retries : INT NOT NULL DEFAULT 0 +retry_policy : JSONB NULL +retry_cause : TEXT NULL +``` + +Rationale: `parent_session_id`, `retry_count`, `max_retries` are first-class columns because they are queried for filters and joins. The rest live in JSONB. **No new history table** — the chain is a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. Cheaper than a separate history table and consistent with Backend.AI's existing model. + +### Decision and dispatch + +A new handler at `event_dispatcher/handlers/session_retry.py` subscribes to `session.terminated` / `session.error`: + +1. Load session. If `retry_count >= max_retries` → emit `session.retry_exhausted` and return. +2. Classify failure. If cause not in `eligible_causes` (or in hardcoded never-retry set) → return. +3. Acquire row lock with `select_for_update()`. If a child with deterministic `creation_id = parent.creation_id + ":retry:" + (retry_count + 1)` already exists → return (idempotency). +4. Compute `delay` per the formula above. +5. Schedule retry creation through the existing background task / event mechanism with the computed delay. Do not block the handler on a sleep. + +The retry path calls `SessionService.create_from_params()` with a `CreateFromParamsAction` derived from the parent (image, mounts, resource_slots, env, cluster spec, batch entrypoint). The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`. + +**No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` field on the API tells clients "attempt N of M" / "this session has a pending child." This avoids touching the scheduler state machine. + +### API surface + +REST v2 (`api/rest/v2/sessions/`): + +- `POST /sessions` — accept optional `retry_policy` in the request body. +- `GET /sessions/{id}` — return `parent_session_id`, `retry_count`, `max_retries`, `retry_policy`, `retry_cause`, plus computed `retry_chain` (oldest → newest IDs). +- `GET /sessions/{id}/attempts` — return the chain with status of each attempt. + +GraphQL v2: mirror in `api/gql/session/types.py` — `parentSession`, `retryCount`, `maxRetries`, `retryPolicy`, `retryCause`, resolver `retryChain`. + +Client SDK v2 + CLI v2: expose new fields; `./bai session info` shows `attempt N of M` and links to the parent. + +**No retry mutation in v1.** Manual retry is deferred until the auto path is stable. + +### Observability + +- Counters: `bai_session_retry_scheduled_total{cause}`, `bai_session_retry_exhausted_total{cause}`, `bai_session_retry_succeeded_total`. +- Events: `session.retry_scheduled`, `session.retry_exhausted` — consumable by the webhook plugin. Replace the role of Airflow's `on_retry_callback` for downstream consumers. +- Audit log entry per retry dispatch (auto, cause, attempt N of M). + +## Migration / Compatibility + +### Backward compatibility + +- Default `max_retries=0` ⇒ zero behavior change for existing callers. +- All new columns are nullable or default to safe zero values. +- Existing GraphQL and REST clients continue to work; new fields are additive. + +### Migration steps + +1. Apply Alembic migration adding the five columns. Migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`. +2. Deploy manager with retry handler and surface, default off via etcd. +3. Operators opt in by setting cluster default or per-session policy. +4. External orchestrators may continue using their own retry layers; migration to native retry is independent and incremental. + +### Breaking changes + +None. + +## Implementation Plan + +Six PRs, each tracked by its own sub-issue under #11320: + +1. **BEP draft** (this document) — #11321. +2. **Foundation:** `RetryPolicy` DTO, `classify_failure` module, backoff utility (with deterministic jitter). Pure functions, no I/O, unit-test heavy. +3. **Schema:** Alembic migration, `SessionRow` field expansion, repository read/write for retry chain. Backportable. +4. **Retry engine:** event handler, `SessionService.create_from_params` extension, defaults precedence (project/domain/etcd), counters/events/audit. +5. **API surface:** REST v2 and GraphQL v2 fields, `attempts` endpoint. +6. **Client:** SDK v2, CLI v2 (`./bai session info` retry view), user docs. + +Tests live with the code under test. Cross-cutting integration tests (transient → retry → success; exhaustion path; concurrent dispatch idempotency; jitter determinism) ship with the retry-engine PR. + +Estimated effort: three to four weeks for one engineer. + +## Decision Log + +| Date | Decision | Rationale | +|------|----------|-----------| +| 2026-04-27 | Batch sessions only in v1 | Interactive sessions are user-driven and do not fit auto-retry semantics. | +| 2026-04-27 | Each retry is a fresh session, linked via `parent_session_id` | Matches existing pipeline orchestrator semantics; avoids reusing kernels/scratch and the complexity that would entail. | +| 2026-04-27 | No new `RETRYING` status | Parent goes to `ERROR`, child starts `PENDING` — avoids touching the scheduler state machine. Computed `retry_state` on the API is enough for clients. | +| 2026-04-27 | Linked-list chain, not a separate history table | The chain is already a list of real `SessionRow`s; no need to duplicate. | +| 2026-04-27 | Structural `RetryEligibleCause` enum, not exception-typed | Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process. | +| 2026-04-27 | `KERNEL_NONZERO_EXIT` is in the default eligible set | `max_retries > 0` should be the only knob a typical user touches; matches Airflow's "retry on failure, period" model. | +| 2026-04-27 | `USER_CANCELLED` / `VALIDATION_ERROR` / `QUOTA_EXCEEDED` are hardcoded non-retriable | These are permanent by definition; users cannot opt them into retry. | +| 2026-04-27 | No retry mutation in v1 | Auto path stabilizes first; manual retry's interaction with `max_retries` is itself a design decision. | +| 2026-04-27 | Idempotency via deterministic child `creation_id` | Reuses an existing field; no new uniqueness constraint required. | +| 2026-04-27 | Deterministic jitter seed = `(session_id, retry_count)` | Reproducible for tests; trade-off vs. unpredictability is acceptable for a server-side retry. | + +## Open Questions + +- Quota accounting: do retries count against concurrent-session limits? Likely yes, but needs a product call. +- Retry-storm kill switch: should the etcd default be a single boolean toggle, a rate limit, or both? Leaning toward a boolean for v1 with a rate limit deferred. +- Manual retry in v2: counts toward `max_retries` or independent? Decide before exposing. +- Default for `max_retry_delay`: 1 h is conservative for long-running batch jobs that might benefit from a longer cooldown after repeated failures. Revisit after telemetry. +- Project/domain defaults table location: extend an existing table or add a small new `project_retry_defaults` table? + +## References + +- Working draft: `docs/investigation/native-session-retry-plan.md` +- Apache Airflow retry implementation: `airflow-core/src/airflow/models/taskinstance.py:1109-1159` +- Existing scheduler state-machine BEP: [BEP-1030](BEP-1030-sokovan-scheduler-status-transition.md) +- Alembic backport strategy: `src/ai/backend/manager/models/alembic/README.md` diff --git a/proposals/README.md b/proposals/README.md index b0024efe64e..f93b0b63f5b 100644 --- a/proposals/README.md +++ b/proposals/README.md @@ -123,6 +123,7 @@ BEP numbers start from 1000. | [1050](BEP-1050-prometheus-query-preset-system.md) | Prometheus Query Preset System | BoKeum Kim | Draft | | [1051](BEP-1051-kata-containers-agent.md) | Kata Containers Agent Backend | Kyujin Cho | Draft | | [1052](BEP-1052-scoped-app-config-redesign.md) | Scoped App Config Redesign | Gyubong Lee | Draft | +| [1053](BEP-1053-native-session-retry.md) | Native Session Retry | Jeongseok Kang | Draft | | _next_ | _(reserve your number here)_ | | | ## File Structure From ed3c61a3a2dfb6f2f9c47caa6a484eda69dee528 Mon Sep 17 00:00:00 2001 From: Jeongseok Kang Date: Mon, 27 Apr 2026 10:21:01 +0900 Subject: [PATCH 2/9] docs(BA-5851): add news fragment for BEP-1053 Co-Authored-By: Claude Opus 4.7 (1M context) --- changes/11322.doc.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 changes/11322.doc.md diff --git a/changes/11322.doc.md b/changes/11322.doc.md new file mode 100644 index 00000000000..4c246d3f07e --- /dev/null +++ b/changes/11322.doc.md @@ -0,0 +1 @@ +Add BEP-1053 proposing native session-level retry for batch sessions, with a `RetryPolicy` schema modeled after Apache Airflow and adapted to Backend.AI's event-driven model From 1404ebdd161675e4eebede223c2b80f9081dc881 Mon Sep 17 00:00:00 2001 From: Jeongseok Kang Date: Mon, 27 Apr 2026 11:10:02 +0900 Subject: [PATCH 3/9] docs(BA-5851): refresh BEP-1053 against current main Verified code paths against latest main: - SessionStatus enum at data/session/types.py:30-50; terminal_statuses() at line 109 - SessionEventHandler at event_dispatcher/handlers/session.py:52, with handlers for started/cancelled/terminating/terminated; status_data["error"] already consulted in handle_session_terminated - SessionService.create_from_params at services/session/service.py:255 - SessionRow.creation_id at models/session/row.py:389-390 Drop redundant tables (Decision Log, Airflow-mapping) and the Open Questions section to match the format used by recent BEPs (BEP-1049, BEP-1050). Co-Authored-By: Claude Opus 4.7 (1M context) --- proposals/BEP-1053-native-session-retry.md | 198 +++++++++------------ 1 file changed, 89 insertions(+), 109 deletions(-) diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md index 597634863e0..bc40ec329cd 100644 --- a/proposals/BEP-1053-native-session-retry.md +++ b/proposals/BEP-1053-native-session-retry.md @@ -11,46 +11,72 @@ Implemented-Version: ## Related Issues +- JIRA: BA-5851 - GitHub Epic: #11320 - GitHub: #11321 ## Motivation -Backend.AI core has no session-level retry. A `BATCH` session that fails — image pull error, transient agent failure, OOM, scheduler timeout, kernel non-zero exit — becomes terminal in `ERROR`, and the user must manually re-create it. +Backend.AI core has no session-level retry. A `BATCH` session that fails — image pull error, transient agent failure, OOM, scheduler timeout, kernel non-zero exit — becomes terminal in `ERROR`, and the user must manually re-create it. The only retry-shaped logic in core today is infrastructure-level: DB transaction retry (`account_manager/models/utils.py:execute_with_txn_retry`), kernel restart on the agent (`agent.py:RestartTracker`), and `tenacity`-wrapped HTTP/socket retries. None of these handle "the session as a whole failed; create a fresh one with the same spec." -The only retry-shaped logic in core today is infrastructure-level: DB transaction retry (`account_manager/models/utils.py`), kernel restart on the agent (`agent/agent.py:restarting_kernels`), and `tenacity`-wrapped HTTP/socket retries. None of these handle "the session as a whole failed; create a fresh one with the same spec." +The retry concern is therefore pushed to every higher-level orchestrator on top of Backend.AI, each of which re-implements the same logic with inconsistent semantics. Lifting retry into core gives one source of truth, resilience for plain batch workloads, and lets orchestrators thin out their own retry layers. -This pushes the retry concern out to every higher-level orchestrator on top of Backend.AI. Each one re-implements the same logic, with inconsistent semantics. Pushing retry into core gives: +### Goals -- A single source of truth for retry semantics — backoff, jitter, eligibility — shared by every caller. -- Resilience for plain batch workloads without requiring an external orchestrator. -- Reduced duplication; orchestrators above Backend.AI can thin out their retry layers. +- Opt-in automatic retry for `BATCH` sessions with a `RetryPolicy` accepted at session creation. +- Each retry is a fresh session linked to its parent — no kernel reuse, no new status state. +- Default `max_retries=0` keeps current behavior intact. +- A single user-facing knob: setting `max_retries > 0` retries on any non-permanent failure. ## Current Design -Session statuses are defined in `src/ai/backend/manager/data/session/types.py:30-51`: +### Session lifecycle + +`SessionStatus` (`src/ai/backend/manager/data/session/types.py:30-50`) defines the lifecycle: ``` PENDING → SCHEDULED → PREPARING → PULLING → PREPARED → CREATING → RUNNING → TERMINATING → TERMINATED ``` -Terminal statuses with no further transitions: `ERROR`, `TERMINATED`, `CANCELLED`. `SessionStatus.retriable_statuses()` (line 118) classifies which startup states are scheduling-retriable, but there is no notion of *re-creating* a terminal `ERROR` session. +`terminal_statuses()` (line 109) is `{ERROR, TERMINATED, CANCELLED}` — no transitions out. `retriable_statuses()` (line 118) classifies which startup states the **scheduler** considers retriable for re-dispatch within the same session, but there is no concept of *re-creating* a session that has already gone terminal. + +### Session creation path + +``` +POST /v2/sessions + → CreateFromParamsAction + → SessionService.create_from_params (services/session/service.py:255) + → repository → SessionRow (models/session/row.py:384) +``` + +`SessionRow.creation_id` (lines 389–390) is a 32-character idempotency key reused across kernel placements; we can extend it to also key retry attempts. + +There are no fields for `parent_session_id`, `retry_count`, `max_retries`, or a retry policy on `SessionRow`. -Session creation flows through `API handler → SessionService.create_from_params() → repository → SessionRow`. `SessionRow.creation_id` already exists as an idempotency key. There are no fields for `parent_session_id`, `retry_count`, `max_retries`, or a retry policy. +### Termination event handling -The termination event handler (`event_dispatcher/handlers/session.py`) listens to `session.terminated` / `session.error` but has no retry decision hook. +`SessionEventHandler` (`event_dispatcher/handlers/session.py:52`) already subscribes to the relevant events: -No prior BEP covers session retry or fault tolerance. +| Method | Event | Line | +|---|---|---| +| `handle_session_started` | `SessionStartedAnycastEvent` | 88 | +| `handle_session_cancelled` | `SessionFailureAnycastEvent` | 105 | +| `handle_session_terminating` | `SessionTerminatingAnycastEvent` | 118 | +| `handle_session_terminated` | `SessionTerminatedAnycastEvent` | 130 | + +`handle_session_terminated` already consults `session.status_data["error"]` for endpoint-route bookkeeping, so the failure metadata needed for retry classification is already on hand at this point. What is missing is the decision: "should we spawn a child session?" + +No prior BEP covers session retry or fault tolerance. BEP-1030 (scheduler status transitions) covers in-session retries by the scheduler, not session re-creation. ## Proposed Design ### Mental model -`max_retries > 0` means "retry on failure." Users should not need to opt in twice. Apache Airflow takes the same stance — any non-fatal exception triggers retry up to `retries`. The classification's job is only to exclude failures that semantically must not retry (cancellation, validation, quota), not to gate ordinary failure modes. +`max_retries > 0` means "retry on failure." Users should not need to opt in twice. Apache Airflow takes the same stance — any non-fatal exception triggers retry up to `retries`. Classification's job is only to exclude failures that semantically must not retry (cancellation, validation, quota), not to gate ordinary failure modes. ### `RetryPolicy` schema -A Pydantic DTO accepted at session creation, modeled on Airflow's parameter surface: +A Pydantic DTO at `common/dto/manager/v2/session/retry_policy.py` (per the manager `data/` layer rule that Pydantic models live under `dto/`, not `data/`). Schema modeled on Airflow's parameter surface: ```python class BackoffStrategy(StrEnum): @@ -72,11 +98,7 @@ class RetryEligibleCause(StrEnum): @classmethod def defaults(cls) -> frozenset["RetryEligibleCause"]: - return frozenset({ - cls.AGENT_TRANSIENT, cls.SCHEDULER_TIMEOUT, - cls.IMAGE_PULL_FAILURE, cls.KERNEL_NONZERO_EXIT, - cls.OOM_KILLED, cls.UNKNOWN, - }) + return frozenset(cls) class RetryPolicy(BaseModel): max_retries: NonNegativeInt = 0 @@ -92,31 +114,17 @@ class RetryPolicy(BaseModel): emit_retry_events: bool = True ``` -Mapping to Airflow: - -| Airflow | `RetryPolicy` | -|---|---| -| `retries` | `max_retries` (count, total attempts = `1 + max_retries`) | -| `retry_delay` | `retry_delay` (seconds) | -| `retry_exponential_backoff` (multiplier) | `backoff: fixed\|exponential` + `backoff_multiplier` | -| `max_retry_delay` (with 24 h hard ceiling) | `max_retry_delay` (24 h hard ceiling preserved) | -| SHA1-deterministic jitter | `jitter` (selectable: none / deterministic / random), `jitter_ratio` | -| Exception-typed eligibility | Structural enum `RetryEligibleCause` | -| `on_retry_callback` | `session.retry_scheduled` / `session.retry_exhausted` events | -| `default_args` precedence | Per-session > project/domain default > etcd cluster default | -| `email_on_retry` | Subsumed by event subscription via webhook plugin | +Notable deviations from Airflow: -Deviations from Airflow and their reasons: - -- **No callback parameter.** Keeps the policy serializable and the server's behavior auditable. Backend.AI is event-driven; downstream consumers subscribe to `session.retry_*` events. -- **Structural cause enum, not exception types.** Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process. -- **`max_retries` is a count.** Total attempts = `1 + max_retries`, matching Backend.AI conventions and the existing pipeline orchestrator. +- **No callback parameter.** Backend.AI is event-driven; downstream consumers subscribe to `session.retry_*` events instead of registering an `on_retry_callback`. Keeps the policy serializable and the server's behavior fully auditable. +- **Structural cause enum, not exception types.** Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process; classification reads `status_data` instead. +- **`max_retries` is a count.** Total attempts = `1 + max_retries`, matching Backend.AI naming and the existing pipeline orchestrator on top of Backend.AI. ### Failure classification -A central `classify_failure(session, status_data) → RetryEligibleCause`. Hardcoded non-retriable causes outside the enum: `USER_CANCELLED`, `VALIDATION_ERROR`, `QUOTA_EXCEEDED`. Users cannot opt these into retry. +A central `classify_failure(session, status_data) → RetryEligibleCause`. Hardcoded never-retriable causes live outside the enum: `USER_CANCELLED`, `VALIDATION_ERROR`, `QUOTA_EXCEEDED`. Users cannot opt these into retry. -| Cause | Default eligible | Notes | +| Cause | In default eligible set | Notes | |---|---|---| | `AGENT_TRANSIENT` | yes | Lost heartbeat, agent restart mid-run. | | `SCHEDULER_TIMEOUT` | yes | Kernel-creation timeout under cluster pressure. | @@ -138,126 +146,98 @@ delay = apply_jitter(base, mode=jitter, ratio=jitter_ratio, delay = min(delay, max_retry_delay or MAX_RETRY_DELAY) ``` -`MAX_RETRY_DELAY` is a hard 24 h ceiling. Deterministic jitter takes `SHA1(session_id || retry_count) mod (base * jitter_ratio)`, yielding reproducible delays — useful for tests. Random jitter samples uniformly in `[base, base * (1 + jitter_ratio))`. +`MAX_RETRY_DELAY` is a hard 24 h ceiling, matching Airflow. Deterministic jitter is `SHA1(session_id || retry_count) mod (base * jitter_ratio)`, yielding reproducible delays — useful for tests. Random jitter samples uniformly in `[base, base * (1 + jitter_ratio))`. ### Defaults precedence Three layers, matching Airflow's `default_args` propagation: 1. Per-session policy in the create request. -2. Project / domain default (new optional field, admin-managed). -3. Cluster default in etcd: `config/manager/retry_policy_default`. Ship default: `max_retries=0` → no behavior change. +2. Project / domain default (new optional field on the project config; admin-managed). +3. Cluster default in etcd: `config/manager/retry_policy_default`. -Effective policy = deep-merge top-down; per-session wins. +Effective policy = deep-merge top-down; per-session wins. Ship default at layer 3 is `max_retries=0`. ### Data model One Alembic migration adds to `sessions`: -``` -parent_session_id : UUID NULL (self-FK) -retry_count : INT NOT NULL DEFAULT 0 -max_retries : INT NOT NULL DEFAULT 0 -retry_policy : JSONB NULL -retry_cause : TEXT NULL -``` +| Column | Type | Description | +|---|---|---| +| `parent_session_id` | `UUID NULL` | Self-FK to `sessions.id`; null for the first attempt. | +| `retry_count` | `INT NOT NULL DEFAULT 0` | 0 for the first attempt. | +| `max_retries` | `INT NOT NULL DEFAULT 0` | Denormalized from policy for cheap filters. | +| `retry_policy` | `JSONB NULL` | Full policy. | +| `retry_cause` | `TEXT NULL` | Classified cause that triggered the most recent retry into this attempt. | -Rationale: `parent_session_id`, `retry_count`, `max_retries` are first-class columns because they are queried for filters and joins. The rest live in JSONB. **No new history table** — the chain is a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. Cheaper than a separate history table and consistent with Backend.AI's existing model. +`parent_session_id`, `retry_count`, and `max_retries` are first-class columns because they appear in filters and joins; the rest live in JSONB. **No new history table** — the chain is already a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. The migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`. ### Decision and dispatch -A new handler at `event_dispatcher/handlers/session_retry.py` subscribes to `session.terminated` / `session.error`: +The retry decision is added to the existing termination-event path. Two integration points are equivalent in correctness; the implementation PR will pick one: + +- **Extend `SessionEventHandler`** in `event_dispatcher/handlers/session.py` with a `handle_session_failure` method (or fold the decision into `handle_session_terminated`), since failure metadata is already read there for endpoint-route bookkeeping. +- **Add a sokovan post-processor** under `sokovan/scheduler/post_processors/`, invoked when the scheduler observes a session entering a terminal failure state. + +The decision flow is the same regardless: 1. Load session. If `retry_count >= max_retries` → emit `session.retry_exhausted` and return. -2. Classify failure. If cause not in `eligible_causes` (or in hardcoded never-retry set) → return. -3. Acquire row lock with `select_for_update()`. If a child with deterministic `creation_id = parent.creation_id + ":retry:" + (retry_count + 1)` already exists → return (idempotency). +2. Classify failure via `classify_failure(session, status_data)`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return. +3. Acquire a row lock with `select_for_update()`. If a child whose deterministic `creation_id = parent.creation_id + ":retry:" + (retry_count + 1)` already exists, return (idempotency). 4. Compute `delay` per the formula above. 5. Schedule retry creation through the existing background task / event mechanism with the computed delay. Do not block the handler on a sleep. -The retry path calls `SessionService.create_from_params()` with a `CreateFromParamsAction` derived from the parent (image, mounts, resource_slots, env, cluster spec, batch entrypoint). The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`. +The retry path calls `SessionService.create_from_params()` with a `CreateFromParamsAction` derived from the parent (image, mounts, `resource_slots`, env, cluster spec, batch entrypoint). The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`. -**No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` field on the API tells clients "attempt N of M" / "this session has a pending child." This avoids touching the scheduler state machine. +**No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` field on the API tells clients "attempt N of M" or "this session has a pending child." This avoids touching the scheduler state machine. ### API surface REST v2 (`api/rest/v2/sessions/`): -- `POST /sessions` — accept optional `retry_policy` in the request body. -- `GET /sessions/{id}` — return `parent_session_id`, `retry_count`, `max_retries`, `retry_policy`, `retry_cause`, plus computed `retry_chain` (oldest → newest IDs). -- `GET /sessions/{id}/attempts` — return the chain with status of each attempt. +| Method | Path | Purpose | +|---|---|---| +| `POST` | `/sessions` | Accept optional `retry_policy` in `SessionCreateRequest`. | +| `GET` | `/sessions/{id}` | Return `parent_session_id`, `retry_count`, `max_retries`, `retry_policy`, `retry_cause`, plus computed `retry_chain` (oldest → newest IDs). | +| `GET` | `/sessions/{id}/attempts` | Return the chain with the status of each attempt. | -GraphQL v2: mirror in `api/gql/session/types.py` — `parentSession`, `retryCount`, `maxRetries`, `retryPolicy`, `retryCause`, resolver `retryChain`. +GraphQL v2: mirror in `api/gql/session/types.py` — `parentSession`, `retryCount`, `maxRetries`, `retryPolicy`, `retryCause`, `retryChain` resolver. -Client SDK v2 + CLI v2: expose new fields; `./bai session info` shows `attempt N of M` and links to the parent. +Client SDK v2 + CLI v2: expose the new fields; `./bai session info` shows `attempt N of M` and links to the parent. -**No retry mutation in v1.** Manual retry is deferred until the auto path is stable. +No retry mutation in v1; manual retry is deferred until the auto path stabilizes. ### Observability - Counters: `bai_session_retry_scheduled_total{cause}`, `bai_session_retry_exhausted_total{cause}`, `bai_session_retry_succeeded_total`. -- Events: `session.retry_scheduled`, `session.retry_exhausted` — consumable by the webhook plugin. Replace the role of Airflow's `on_retry_callback` for downstream consumers. -- Audit log entry per retry dispatch (auto, cause, attempt N of M). +- Events: `session.retry_scheduled`, `session.retry_exhausted` — consumable by the webhook plugin, replacing the role of Airflow's `on_retry_callback` for downstream consumers. +- Audit log entry per retry dispatch: cause and attempt N of M. ## Migration / Compatibility -### Backward compatibility - -- Default `max_retries=0` ⇒ zero behavior change for existing callers. -- All new columns are nullable or default to safe zero values. -- Existing GraphQL and REST clients continue to work; new fields are additive. - -### Migration steps - -1. Apply Alembic migration adding the five columns. Migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`. -2. Deploy manager with retry handler and surface, default off via etcd. -3. Operators opt in by setting cluster default or per-session policy. -4. External orchestrators may continue using their own retry layers; migration to native retry is independent and incremental. - -### Breaking changes - -None. +- Default `max_retries=0` keeps behavior unchanged for every existing caller. +- All new columns are nullable or default to safe zero values; the Alembic migration is purely additive. +- Existing GraphQL and REST clients continue to work; new fields are additive on responses. +- Operators opt in by setting the cluster default in etcd or a per-session policy. +- External orchestrators may continue using their own retry layers; migration to native retry is independent and incremental. +- No breaking changes. ## Implementation Plan Six PRs, each tracked by its own sub-issue under #11320: 1. **BEP draft** (this document) — #11321. -2. **Foundation:** `RetryPolicy` DTO, `classify_failure` module, backoff utility (with deterministic jitter). Pure functions, no I/O, unit-test heavy. -3. **Schema:** Alembic migration, `SessionRow` field expansion, repository read/write for retry chain. Backportable. -4. **Retry engine:** event handler, `SessionService.create_from_params` extension, defaults precedence (project/domain/etcd), counters/events/audit. +2. **Foundation:** `RetryPolicy` DTO, `classify_failure`, backoff utility with deterministic jitter. Pure, no I/O, unit-test heavy. +3. **Schema:** Alembic migration, `SessionRow` field expansion, repository read/write for the retry chain. +4. **Retry engine:** decision integration in the termination-event path, `SessionService.create_from_params` extension to inherit retry context, defaults precedence (project/domain/etcd), counters/events/audit. 5. **API surface:** REST v2 and GraphQL v2 fields, `attempts` endpoint. 6. **Client:** SDK v2, CLI v2 (`./bai session info` retry view), user docs. -Tests live with the code under test. Cross-cutting integration tests (transient → retry → success; exhaustion path; concurrent dispatch idempotency; jitter determinism) ship with the retry-engine PR. - -Estimated effort: three to four weeks for one engineer. - -## Decision Log - -| Date | Decision | Rationale | -|------|----------|-----------| -| 2026-04-27 | Batch sessions only in v1 | Interactive sessions are user-driven and do not fit auto-retry semantics. | -| 2026-04-27 | Each retry is a fresh session, linked via `parent_session_id` | Matches existing pipeline orchestrator semantics; avoids reusing kernels/scratch and the complexity that would entail. | -| 2026-04-27 | No new `RETRYING` status | Parent goes to `ERROR`, child starts `PENDING` — avoids touching the scheduler state machine. Computed `retry_state` on the API is enough for clients. | -| 2026-04-27 | Linked-list chain, not a separate history table | The chain is already a list of real `SessionRow`s; no need to duplicate. | -| 2026-04-27 | Structural `RetryEligibleCause` enum, not exception-typed | Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process. | -| 2026-04-27 | `KERNEL_NONZERO_EXIT` is in the default eligible set | `max_retries > 0` should be the only knob a typical user touches; matches Airflow's "retry on failure, period" model. | -| 2026-04-27 | `USER_CANCELLED` / `VALIDATION_ERROR` / `QUOTA_EXCEEDED` are hardcoded non-retriable | These are permanent by definition; users cannot opt them into retry. | -| 2026-04-27 | No retry mutation in v1 | Auto path stabilizes first; manual retry's interaction with `max_retries` is itself a design decision. | -| 2026-04-27 | Idempotency via deterministic child `creation_id` | Reuses an existing field; no new uniqueness constraint required. | -| 2026-04-27 | Deterministic jitter seed = `(session_id, retry_count)` | Reproducible for tests; trade-off vs. unpredictability is acceptable for a server-side retry. | - -## Open Questions - -- Quota accounting: do retries count against concurrent-session limits? Likely yes, but needs a product call. -- Retry-storm kill switch: should the etcd default be a single boolean toggle, a rate limit, or both? Leaning toward a boolean for v1 with a rate limit deferred. -- Manual retry in v2: counts toward `max_retries` or independent? Decide before exposing. -- Default for `max_retry_delay`: 1 h is conservative for long-running batch jobs that might benefit from a longer cooldown after repeated failures. Revisit after telemetry. -- Project/domain defaults table location: extend an existing table or add a small new `project_retry_defaults` table? +Tests live with the code under test. Cross-cutting integration tests — transient → retry → success, exhaustion, concurrent dispatch idempotency, jitter determinism — ship with the retry-engine PR. Estimated effort: three to four weeks for one engineer. ## References - Working draft: `docs/investigation/native-session-retry-plan.md` - Apache Airflow retry implementation: `airflow-core/src/airflow/models/taskinstance.py:1109-1159` -- Existing scheduler state-machine BEP: [BEP-1030](BEP-1030-sokovan-scheduler-status-transition.md) +- [BEP-1030: Sokovan Scheduler Status Transition Design](BEP-1030-sokovan-scheduler-status-transition.md) - Alembic backport strategy: `src/ai/backend/manager/models/alembic/README.md` From 8305ac04a2a0a076316ea5cad9fff60cc04ffa25 Mon Sep 17 00:00:00 2001 From: Jeongseok Kang Date: Mon, 27 Apr 2026 11:11:32 +0900 Subject: [PATCH 4/9] docs(BA-5851): use SQLAlchemy with_for_update in BEP-1053 Replace the Django-style select_for_update() reference with SQLAlchemy 2.x syntax matching the existing pattern in repositories/agent/db_source.py and repositories/deployment/db_source.py: sa.select(SessionRow).where(...) .with_for_update() inside begin_session(). Co-Authored-By: Claude Opus 4.7 (1M context) --- proposals/BEP-1053-native-session-retry.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md index bc40ec329cd..52fbc71fb8b 100644 --- a/proposals/BEP-1053-native-session-retry.md +++ b/proposals/BEP-1053-native-session-retry.md @@ -183,7 +183,7 @@ The decision flow is the same regardless: 1. Load session. If `retry_count >= max_retries` → emit `session.retry_exhausted` and return. 2. Classify failure via `classify_failure(session, status_data)`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return. -3. Acquire a row lock with `select_for_update()`. If a child whose deterministic `creation_id = parent.creation_id + ":retry:" + (retry_count + 1)` already exists, return (idempotency). +3. Lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` inside the session repository's `begin_session()` transaction. If a child whose deterministic `creation_id = parent.creation_id + ":retry:" + (retry_count + 1)` already exists, return (idempotency). 4. Compute `delay` per the formula above. 5. Schedule retry creation through the existing background task / event mechanism with the computed delay. Do not block the handler on a sleep. From f0794157d7239fd6bddba6fa6a55ca91756985d3 Mon Sep 17 00:00:00 2001 From: Jeongseok Kang Date: Mon, 27 Apr 2026 11:16:47 +0900 Subject: [PATCH 5/9] docs(BA-5851): tighten BEP-1053 idempotency, dispatch, and dependencies Address production-readiness review findings: - Add partial unique index on (parent_session_id, retry_count) as the real idempotency guarantee for retry dispatch. The parent row lock alone is insufficient because creation_id has unique=False (models/session/row.py:390); under concurrent handlers the second INSERT would have succeeded silently. The unique index makes duplicate child creation a hard failure. - Commit to extending SessionEventHandler in event_dispatcher/handlers/ session.py for the retry decision, instead of leaving "two equivalent integration points." Sokovan post-processors run during scheduling iterations, which complicates idempotency without adding capability. Also notes the recent sokovan refactor (#11250 / 8321c79aa) so future readers see why the post-processor path was rejected. - Specifically name BackgroundTaskManager.start_retriable() (already injected into SessionService at service.py:245,408) as the dispatch primitive instead of vague "background task / event mechanism." - Defer project/domain default layer to a follow-up after BEP-1052 (Scoped App Config Redesign) lands, so this BEP doesn't conflict with in-flight config-surface work. - Document the retry handler's own failure modes (classify_failure raise, start_retriable enqueue failure) and the accounting policy (each retry counts against quota; no refund). - Clarify retriable_statuses() is unrelated (in-session re-dispatch, not session re-creation), point DTO location reference to the manager data/CLAUDE.md rule, mark retry_state as an API-layer resolver. Co-Authored-By: Claude Opus 4.7 (1M context) --- proposals/BEP-1053-native-session-retry.md | 45 ++++++++++++++-------- 1 file changed, 28 insertions(+), 17 deletions(-) diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md index 52fbc71fb8b..83dbff823c0 100644 --- a/proposals/BEP-1053-native-session-retry.md +++ b/proposals/BEP-1053-native-session-retry.md @@ -38,7 +38,7 @@ The retry concern is therefore pushed to every higher-level orchestrator on top PENDING → SCHEDULED → PREPARING → PULLING → PREPARED → CREATING → RUNNING → TERMINATING → TERMINATED ``` -`terminal_statuses()` (line 109) is `{ERROR, TERMINATED, CANCELLED}` — no transitions out. `retriable_statuses()` (line 118) classifies which startup states the **scheduler** considers retriable for re-dispatch within the same session, but there is no concept of *re-creating* a session that has already gone terminal. +`terminal_statuses()` (line 109) is `{ERROR, TERMINATED, CANCELLED}` — no transitions out. `retriable_statuses()` (line 118) is unrelated to this BEP: it tells the scheduler which **startup** states are still safe to re-dispatch *within the same session*. This BEP introduces a separate concept — re-creating a fresh session after the previous one has gone terminal. ### Session creation path @@ -76,7 +76,7 @@ No prior BEP covers session retry or fault tolerance. BEP-1030 (scheduler status ### `RetryPolicy` schema -A Pydantic DTO at `common/dto/manager/v2/session/retry_policy.py` (per the manager `data/` layer rule that Pydantic models live under `dto/`, not `data/`). Schema modeled on Airflow's parameter surface: +A Pydantic DTO at `src/ai/backend/common/dto/manager/v2/session/retry_policy.py`, matching the v2 DTO location used by other recent BEPs. Per `src/ai/backend/manager/data/CLAUDE.md`, `data/` is reserved for frozen dataclasses with no framework deps; Pydantic models live under `common/dto/` so they can be shared across REST v2 and GraphQL. Schema modeled on Airflow's parameter surface: ```python class BackoffStrategy(StrEnum): @@ -150,13 +150,14 @@ delay = min(delay, max_retry_delay or MAX_RETRY_DELAY) ### Defaults precedence -Three layers, matching Airflow's `default_args` propagation: +Two layers in v1, matching Airflow's `default_args` spirit while staying compatible with parallel work on the config surface: 1. Per-session policy in the create request. -2. Project / domain default (new optional field on the project config; admin-managed). -3. Cluster default in etcd: `config/manager/retry_policy_default`. +2. Cluster default in etcd: `config/manager/retry_policy_default` (ship default `max_retries=0` — no behavior change). -Effective policy = deep-merge top-down; per-session wins. Ship default at layer 3 is `max_retries=0`. +Effective policy = deep-merge top-down; per-session wins. + +**Project / domain default is deferred.** [BEP-1052 (Scoped App Config Redesign)](BEP-1052-scoped-app-config-redesign.md) is concurrently rewriting the project / domain config surface around scoped `AppConfigFragment` rows. Adding `retry_policy_default` to the legacy project config row would conflict with that work. After BEP-1052 lands, a follow-up BEP can wire retry defaults into `AppConfigFragment` as a third precedence layer. ### Data model @@ -170,26 +171,28 @@ One Alembic migration adds to `sessions`: | `retry_policy` | `JSONB NULL` | Full policy. | | `retry_cause` | `TEXT NULL` | Classified cause that triggered the most recent retry into this attempt. | -`parent_session_id`, `retry_count`, and `max_retries` are first-class columns because they appear in filters and joins; the rest live in JSONB. **No new history table** — the chain is already a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. The migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`. +The migration also adds a **partial unique index** on `(parent_session_id, retry_count) WHERE parent_session_id IS NOT NULL`. This is the actual idempotency guarantee for retry dispatch: even if two workers race past the parent row lock (different transactions, different timing), the second `INSERT` of a child with the same `(parent, attempt-number)` fails on the unique violation. `creation_id` remains non-unique and is used only for log/trace correlation. -### Decision and dispatch +`parent_session_id`, `retry_count`, and `max_retries` are first-class columns because they appear in filters, joins, and the unique index; the rest live in JSONB. `parent_session_id` is the canonical query for "show me the retry chain of this session." **No new history table** — the chain is already a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. The migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`. -The retry decision is added to the existing termination-event path. Two integration points are equivalent in correctness; the implementation PR will pick one: +### Decision and dispatch -- **Extend `SessionEventHandler`** in `event_dispatcher/handlers/session.py` with a `handle_session_failure` method (or fold the decision into `handle_session_terminated`), since failure metadata is already read there for endpoint-route bookkeeping. -- **Add a sokovan post-processor** under `sokovan/scheduler/post_processors/`, invoked when the scheduler observes a session entering a terminal failure state. +The retry decision lives in `SessionEventHandler` (`event_dispatcher/handlers/session.py:52`), as a new `handle_session_failure` method on the existing class. Rationale: failure metadata (`session.status_data["error"]`) is already loaded there for endpoint-route bookkeeping, the handler runs after the session has reached a terminal status (so the parent state is settled), and adding logic here does not interact with the recently refactored sokovan termination flow (#11250 — `mark_sessions_for_termination()` in `sokovan/scheduling_controller/scheduling_controller.py:266`). A sokovan post-processor was considered but rejected for v1: it runs *during* scheduling iterations, which complicates idempotency and timing without adding capability the event-handler path lacks. -The decision flow is the same regardless: +The decision flow: -1. Load session. If `retry_count >= max_retries` → emit `session.retry_exhausted` and return. +1. Load the parent session. If `retry_count >= max_retries`, emit `session.retry_exhausted` and return. 2. Classify failure via `classify_failure(session, status_data)`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return. -3. Lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` inside the session repository's `begin_session()` transaction. If a child whose deterministic `creation_id = parent.creation_id + ":retry:" + (retry_count + 1)` already exists, return (idempotency). +3. Inside the session repository's `begin_session()` transaction, lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` and re-read `retry_count` to handle racing handlers on the same parent. 4. Compute `delay` per the formula above. -5. Schedule retry creation through the existing background task / event mechanism with the computed delay. Do not block the handler on a sleep. +5. Hand off to `BackgroundTaskManager.start_retriable()` (already injected into `SessionService` at `services/session/service.py:245,408`) with the computed delay and a `CreateFromParamsAction` derived from the parent. The background task framework is already the canonical primitive for durable, replayable, delayed work in the manager — using it avoids inventing a new scheduling path. +6. The child `INSERT` is the second idempotency boundary: the partial unique index on `(parent_session_id, retry_count)` rejects duplicate dispatches that bypass step 3 (e.g., handler crash + replay). + +The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`. The `CreateFromParamsAction` carries the same image, mounts, `resource_slots`, env, cluster spec, and batch entrypoint as the parent. -The retry path calls `SessionService.create_from_params()` with a `CreateFromParamsAction` derived from the parent (image, mounts, `resource_slots`, env, cluster spec, batch entrypoint). The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`. +**Failure mode of the retry handler itself.** If `classify_failure` raises, the session stays in its terminal state and the failure is logged at ERROR level — no retry, no crash propagation. If `BackgroundTaskManager.start_retriable()` fails to enqueue, the parent's `status_data` is annotated with the dispatch failure and `session.retry_exhausted` is emitted. The handler must not raise out of `handle_session_failure`; an unhandled exception in an event handler can stall the dispatcher. -**No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` field on the API tells clients "attempt N of M" or "this session has a pending child." This avoids touching the scheduler state machine. +**No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` (resolved at the API layer, not stored) tells clients "attempt N of M" or "this session has a pending child." This avoids touching the scheduler state machine entirely. ### API surface @@ -222,6 +225,14 @@ No retry mutation in v1; manual retry is deferred until the auto path stabilizes - External orchestrators may continue using their own retry layers; migration to native retry is independent and incremental. - No breaking changes. +### Quota and accounting + +A retry attempt is a fresh `SessionRow` and counts against the user's concurrent-session limit while it is alive — same as if the user had re-submitted manually. The previous attempt's resource consumption is not refunded; this matches the principle that "actual GPU/CPU time was spent, regardless of why the session ended." The API exposes the chain so accounting tools can group attempts under one logical job if they choose. + +### Operational kill switch + +The cluster-level etcd default doubles as a kill switch: setting `config/manager/retry_policy_default` to `{max_retries: 0}` disables retries globally without redeploying the manager. Per-project / per-user kill switches are deferred until the project-default layer lands (see [BEP-1052](BEP-1052-scoped-app-config-redesign.md) dependency above). + ## Implementation Plan Six PRs, each tracked by its own sub-issue under #11320: From 461d0e193a6a14bdacba2cd27da33ac564b01093 Mon Sep 17 00:00:00 2001 From: Jeongseok Kang Date: Mon, 27 Apr 2026 11:22:40 +0900 Subject: [PATCH 6/9] docs(BA-5851): guard BATCH-only and pin status_data error contract Address second-pass review: - Make the BATCH session-type guard explicit at step 1 of the decision flow. SessionEventHandler is shared across BATCH / INTERACTIVE / INFERENCE; without the guard, handle_session_failure would fire for INFERENCE failures too (the same handler already has INFERENCE-specific routing logic at line 210), violating the stated v1 scope. - Pin the status_data["error"] contract to manager/exceptions.py: convert_to_status_data and the ErrorStatusInfo/ErrorDetail TypedDicts (line 97). classify_failure reads error.name and error.src to map to a RetryEligibleCause; without this pin, the classifier would depend on an undocumented shape. Reviewer also confirmed no duplication with pre-existing features (RestartTracker, scheduler retriable_statuses, BEP-1049 deployment retry) -- all PARALLEL, none cover session-level retry of BATCH. Co-Authored-By: Claude Opus 4.7 (1M context) --- proposals/BEP-1053-native-session-retry.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md index 83dbff823c0..e4d923b3ff9 100644 --- a/proposals/BEP-1053-native-session-retry.md +++ b/proposals/BEP-1053-native-session-retry.md @@ -181,8 +181,8 @@ The retry decision lives in `SessionEventHandler` (`event_dispatcher/handlers/se The decision flow: -1. Load the parent session. If `retry_count >= max_retries`, emit `session.retry_exhausted` and return. -2. Classify failure via `classify_failure(session, status_data)`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return. +1. Load the parent session. **Short-circuit unless `session.session_type == SessionTypes.BATCH`** — interactive and inference sessions share `SessionEventHandler` (see line 210's INFERENCE-specific routing in `handle_session_terminated`) and are explicitly out of scope for v1. Then, if `retry_count >= max_retries`, emit `session.retry_exhausted` and return. +2. Classify failure via `classify_failure(session, status_data)`. The shape of `status_data["error"]` is defined by `manager/exceptions.py:convert_to_status_data` (returns `ErrorStatusInfo` / `ErrorDetail` TypedDicts at line 97); classification reads `error.name` and `error.src` to map to a `RetryEligibleCause`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return. 3. Inside the session repository's `begin_session()` transaction, lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` and re-read `retry_count` to handle racing handlers on the same parent. 4. Compute `delay` per the formula above. 5. Hand off to `BackgroundTaskManager.start_retriable()` (already injected into `SessionService` at `services/session/service.py:245,408`) with the computed delay and a `CreateFromParamsAction` derived from the parent. The background task framework is already the canonical primitive for durable, replayable, delayed work in the manager — using it avoids inventing a new scheduling path. From 7e65357ea00fd304b9b588fdefe3a10f9f345463 Mon Sep 17 00:00:00 2001 From: Jeongseok Kang Date: Mon, 27 Apr 2026 11:28:58 +0900 Subject: [PATCH 7/9] docs(BA-5851): rework dispatch primitive and kill switch in BEP-1053 Three real bugs from third-pass review confirmed against the code: 1) BackgroundTaskManager.start_retriable (common/bgtask/bgtask.py:444) does NOT accept a delay parameter -- it fires immediately via asyncio.create_task. The "retriable" name refers to the task body retrying on failure, not delayed scheduling. The BEP picked this primitive based on its name alone. Replace with a durable session_retry_dispatch_queue table + periodic claim worker (outbox pattern). Survives manager restarts; idempotency matches the sessions-table partial unique index. 2) Adding a sibling handle_session_failure method on SessionEventHandler would have created a second handler on SessionFailureAnycastEvent (already subscribed by handle_batch_result at dispatch.py:520), racing against existing bookkeeping with undefined ordering. Fold the retry decision INTO handle_batch_result instead, in the SessionFailureAnycastEvent arm. 3) The "kill switch via cluster default" was contradictory: per-session policy wins on merge, so any user setting max_retries:N bypassed it. Split into two separate etcd keys: retry_policy_default (a default, merged) and retry_disabled (boolean, checked at the top of the decision flow before merge -- a true kill switch). Also: drop the creation_id retry-suffix idea entirely (creation_id is String(32) and could overflow). The partial unique index on (parent_session_id, retry_count) is the only idempotency boundary now; creation_id stays a per-attempt random token. Implementation Plan grew from 6 to 7 PRs (queue + dispatcher worker is its own PR); estimate bumped from 3-4 to 4-5 weeks. Co-Authored-By: Claude Opus 4.7 (1M context) --- proposals/BEP-1053-native-session-retry.md | 63 ++++++++++++++-------- 1 file changed, 42 insertions(+), 21 deletions(-) diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md index e4d923b3ff9..bc57f38dd66 100644 --- a/proposals/BEP-1053-native-session-retry.md +++ b/proposals/BEP-1053-native-session-retry.md @@ -49,7 +49,7 @@ POST /v2/sessions → repository → SessionRow (models/session/row.py:384) ``` -`SessionRow.creation_id` (lines 389–390) is a 32-character idempotency key reused across kernel placements; we can extend it to also key retry attempts. +`SessionRow.creation_id` (lines 389–390) is a 32-character idempotency key reused across kernel placements; today it is generated as `secrets.token_urlsafe(16)` (`services/session/service.py:1593`). It is **not** extended to encode retry chains — those use a separate first-class column (see Data Model). There are no fields for `parent_session_id`, `retry_count`, `max_retries`, or a retry policy on `SessionRow`. @@ -148,14 +148,12 @@ delay = min(delay, max_retry_delay or MAX_RETRY_DELAY) `MAX_RETRY_DELAY` is a hard 24 h ceiling, matching Airflow. Deterministic jitter is `SHA1(session_id || retry_count) mod (base * jitter_ratio)`, yielding reproducible delays — useful for tests. Random jitter samples uniformly in `[base, base * (1 + jitter_ratio))`. -### Defaults precedence +### Defaults and kill switch -Two layers in v1, matching Airflow's `default_args` spirit while staying compatible with parallel work on the config surface: +Two distinct concepts, kept separate to avoid the precedence trap of "default doubles as kill switch": -1. Per-session policy in the create request. -2. Cluster default in etcd: `config/manager/retry_policy_default` (ship default `max_retries=0` — no behavior change). - -Effective policy = deep-merge top-down; per-session wins. +- **Cluster default** in etcd: `config/manager/retry_policy_default` (ship default `max_retries=0` — no behavior change). This is a default; per-session policy wins on merge. Effective policy = deep-merge of cluster default and per-session policy. +- **Cluster kill switch** in etcd: `config/manager/retry_disabled` (boolean, default `false`). Checked at the **top** of the decision flow, before any policy merge. When `true`, no retries are scheduled regardless of per-session policy. Useful for incident response (e.g., disabling retries cluster-wide during a registry outage that would otherwise cause a retry storm). **Project / domain default is deferred.** [BEP-1052 (Scoped App Config Redesign)](BEP-1052-scoped-app-config-redesign.md) is concurrently rewriting the project / domain config surface around scoped `AppConfigFragment` rows. Adding `retry_policy_default` to the legacy project config row would conflict with that work. After BEP-1052 lands, a follow-up BEP can wire retry defaults into `AppConfigFragment` as a third precedence layer. @@ -171,26 +169,48 @@ One Alembic migration adds to `sessions`: | `retry_policy` | `JSONB NULL` | Full policy. | | `retry_cause` | `TEXT NULL` | Classified cause that triggered the most recent retry into this attempt. | -The migration also adds a **partial unique index** on `(parent_session_id, retry_count) WHERE parent_session_id IS NOT NULL`. This is the actual idempotency guarantee for retry dispatch: even if two workers race past the parent row lock (different transactions, different timing), the second `INSERT` of a child with the same `(parent, attempt-number)` fails on the unique violation. `creation_id` remains non-unique and is used only for log/trace correlation. +The migration also adds a **partial unique index** on `(parent_session_id, retry_count) WHERE parent_session_id IS NOT NULL`. This is the actual idempotency guarantee for retry dispatch: even if two workers race past the parent row lock (different transactions, different timing), the second `INSERT` of a child with the same `(parent, attempt-number)` fails on the unique violation. `creation_id` is unchanged and remains a per-attempt random token (no retry encoding). + +A second small table `session_retry_dispatch_queue` is added for durable delayed dispatch (see "Decision and dispatch"): + +| Column | Type | Description | +|---|---|---| +| `parent_session_id` | `UUID NOT NULL` | FK to `sessions.id`. | +| `retry_count` | `INT NOT NULL` | Target attempt number (= parent.retry_count + 1). | +| `scheduled_at` | `TIMESTAMPTZ NOT NULL` | Earliest dispatch time. | +| `claimed_at` | `TIMESTAMPTZ NULL` | Set when a dispatcher worker claims the row. | +| `dispatched_at` | `TIMESTAMPTZ NULL` | Set when the child session has been created. | + +Primary key `(parent_session_id, retry_count)` — the same constraint that protects `sessions` also serializes queue inserts. The queue lets retry decisions survive manager restarts, mirrors the durable-outbox pattern used by the existing pipeline orchestrator on top of Backend.AI, and avoids inventing in-memory scheduling. `parent_session_id`, `retry_count`, and `max_retries` are first-class columns because they appear in filters, joins, and the unique index; the rest live in JSONB. `parent_session_id` is the canonical query for "show me the retry chain of this session." **No new history table** — the chain is already a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. The migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`. ### Decision and dispatch -The retry decision lives in `SessionEventHandler` (`event_dispatcher/handlers/session.py:52`), as a new `handle_session_failure` method on the existing class. Rationale: failure metadata (`session.status_data["error"]`) is already loaded there for endpoint-route bookkeeping, the handler runs after the session has reached a terminal status (so the parent state is settled), and adding logic here does not interact with the recently refactored sokovan termination flow (#11250 — `mark_sessions_for_termination()` in `sokovan/scheduling_controller/scheduling_controller.py:266`). A sokovan post-processor was considered but rejected for v1: it runs *during* scheduling iterations, which complicates idempotency and timing without adding capability the event-handler path lacks. +The retry decision is **folded into the existing `SessionEventHandler.handle_batch_result`** (`event_dispatcher/handlers/session.py:152`), not added as a sibling handler. Rationale: + +- `SessionFailureAnycastEvent` is **already** subscribed by `handle_batch_result` (`event_dispatcher/dispatch.py:520`). Adding a second handler on the same event would race against bookkeeping work (`set_session_result`, etc.) and depend on undefined dispatch ordering. +- Failure metadata (`session.status_data["error"]`) is already loaded in `handle_batch_result` for the existing failure path; the retry decision can reuse it without new DB roundtrips. +- The handler runs after the session has reached a terminal status, so parent state is settled, and the change does not interact with the recently refactored sokovan termination flow (#11250 — `mark_sessions_for_termination()` in `sokovan/scheduling_controller/scheduling_controller.py:266`). + +A sokovan post-processor was considered but rejected for v1: post-processors run *during* scheduling iterations, complicating idempotency and timing without adding capability the event-handler path lacks. -The decision flow: +**Dispatch primitive.** `BackgroundTaskManager.start_retriable()` (`common/bgtask/bgtask.py:444`) is **not** suitable: it accepts no `delay` parameter and fires the task immediately via `asyncio.create_task` ("retriable" refers to the task body retrying on failure, not delayed scheduling). Instead, retries are persisted to the new `session_retry_dispatch_queue` table (see Data Model). A periodic loop in the manager (a sokovan-style worker — same cadence as existing periodic tasks) claims rows where `scheduled_at <= now() AND claimed_at IS NULL` via `UPDATE ... SET claimed_at = now() RETURNING ...` (atomic claim under PostgreSQL's row lock) and invokes `SessionService.create_from_params()`. This pattern is durable across manager restarts and matches the outbox approach used by the sibling pipeline orchestrator. -1. Load the parent session. **Short-circuit unless `session.session_type == SessionTypes.BATCH`** — interactive and inference sessions share `SessionEventHandler` (see line 210's INFERENCE-specific routing in `handle_session_terminated`) and are explicitly out of scope for v1. Then, if `retry_count >= max_retries`, emit `session.retry_exhausted` and return. +The decision flow inside `handle_batch_result` (in the `SessionFailureAnycastEvent` arm): + +1. Load the parent session. **Short-circuit unless `session.session_type == SessionTypes.BATCH`** — interactive and inference sessions share `SessionEventHandler` and are out of scope for v1. Then, if `retry_count >= max_retries`, emit `session.retry_exhausted` and return. 2. Classify failure via `classify_failure(session, status_data)`. The shape of `status_data["error"]` is defined by `manager/exceptions.py:convert_to_status_data` (returns `ErrorStatusInfo` / `ErrorDetail` TypedDicts at line 97); classification reads `error.name` and `error.src` to map to a `RetryEligibleCause`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return. -3. Inside the session repository's `begin_session()` transaction, lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` and re-read `retry_count` to handle racing handlers on the same parent. +3. Inside the session repository's `begin_session()` transaction, lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` and re-read `retry_count`. 4. Compute `delay` per the formula above. -5. Hand off to `BackgroundTaskManager.start_retriable()` (already injected into `SessionService` at `services/session/service.py:245,408`) with the computed delay and a `CreateFromParamsAction` derived from the parent. The background task framework is already the canonical primitive for durable, replayable, delayed work in the manager — using it avoids inventing a new scheduling path. -6. The child `INSERT` is the second idempotency boundary: the partial unique index on `(parent_session_id, retry_count)` rejects duplicate dispatches that bypass step 3 (e.g., handler crash + replay). +5. `INSERT` into `session_retry_dispatch_queue` with `(parent_session_id, retry_count + 1, now() + delay)`. The PK on `(parent_session_id, retry_count)` makes this idempotent: a duplicate dispatch (handler replay, concurrent handlers) hits a unique-violation and the `INSERT` is skipped. Emit `session.retry_scheduled`. +6. The dispatcher worker eventually claims the row, runs `SessionService.create_from_params` with a `CreateFromParamsAction` derived from the parent, and stamps `dispatched_at`. The child `INSERT` is the second idempotency boundary: the partial unique index on `sessions.(parent_session_id, retry_count)` rejects duplicate child rows even if two workers claim the same queue row through PG bug, replication lag, or operational error. The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`. The `CreateFromParamsAction` carries the same image, mounts, `resource_slots`, env, cluster spec, and batch entrypoint as the parent. -**Failure mode of the retry handler itself.** If `classify_failure` raises, the session stays in its terminal state and the failure is logged at ERROR level — no retry, no crash propagation. If `BackgroundTaskManager.start_retriable()` fails to enqueue, the parent's `status_data` is annotated with the dispatch failure and `session.retry_exhausted` is emitted. The handler must not raise out of `handle_session_failure`; an unhandled exception in an event handler can stall the dispatcher. +**Failure mode of the retry decision.** If `classify_failure` raises, the session stays in its terminal state and the failure is logged at ERROR level — no retry, no crash propagation. If the queue `INSERT` fails (DB unavailable), the parent's `status_data` is annotated with the dispatch failure and `session.retry_exhausted` is emitted. The decision must not raise out of `handle_batch_result`; an unhandled exception there would also break existing batch-result bookkeeping. + +**Failure mode of the dispatcher worker.** If `create_from_params` raises after the queue row is claimed, the worker stamps `dispatched_at` to a sentinel value and emits `session.retry_exhausted` with the underlying error. Manager restart while a row is claimed-but-not-dispatched: the worker re-claims rows whose `claimed_at` is older than a configurable lease (e.g., 5 minutes) on startup. **No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` (resolved at the API layer, not stored) tells clients "attempt N of M" or "this session has a pending child." This avoids touching the scheduler state machine entirely. @@ -231,7 +251,7 @@ A retry attempt is a fresh `SessionRow` and counts against the user's concurrent ### Operational kill switch -The cluster-level etcd default doubles as a kill switch: setting `config/manager/retry_policy_default` to `{max_retries: 0}` disables retries globally without redeploying the manager. Per-project / per-user kill switches are deferred until the project-default layer lands (see [BEP-1052](BEP-1052-scoped-app-config-redesign.md) dependency above). +`config/manager/retry_disabled` (etcd, boolean) is the cluster-level kill switch — see "Defaults and kill switch" above. Per-project / per-user kill switches are deferred until the project-default layer lands (see [BEP-1052](BEP-1052-scoped-app-config-redesign.md) dependency above). ## Implementation Plan @@ -239,12 +259,13 @@ Six PRs, each tracked by its own sub-issue under #11320: 1. **BEP draft** (this document) — #11321. 2. **Foundation:** `RetryPolicy` DTO, `classify_failure`, backoff utility with deterministic jitter. Pure, no I/O, unit-test heavy. -3. **Schema:** Alembic migration, `SessionRow` field expansion, repository read/write for the retry chain. -4. **Retry engine:** decision integration in the termination-event path, `SessionService.create_from_params` extension to inherit retry context, defaults precedence (project/domain/etcd), counters/events/audit. -5. **API surface:** REST v2 and GraphQL v2 fields, `attempts` endpoint. -6. **Client:** SDK v2, CLI v2 (`./bai session info` retry view), user docs. +3. **Schema:** Alembic migration adding `SessionRow` retry columns (with the partial unique index) and the `session_retry_dispatch_queue` table; repository read/write for the retry chain. +4. **Retry decision:** fold the decision into `handle_batch_result` in `SessionEventHandler`, queue insert with idempotency, etcd kill switch and cluster default, counters/events/audit. +5. **Dispatcher worker:** periodic claim loop on `session_retry_dispatch_queue`, `SessionService.create_from_params` extension to inherit retry context, lease-based recovery on manager restart. +6. **API surface:** REST v2 and GraphQL v2 fields, `attempts` endpoint. +7. **Client:** SDK v2, CLI v2 (`./bai session info` retry view), user docs. -Tests live with the code under test. Cross-cutting integration tests — transient → retry → success, exhaustion, concurrent dispatch idempotency, jitter determinism — ship with the retry-engine PR. Estimated effort: three to four weeks for one engineer. +Tests live with the code under test. Cross-cutting integration tests — transient → retry → success, exhaustion, concurrent dispatch idempotency, jitter determinism, manager-restart recovery of claimed-but-undispatched queue rows — ship with the dispatcher-worker PR. Estimated effort: four to five weeks for one engineer. ## References From 4dc297c47ebe1218a6e8b7fd0dcac75e8f476d1d Mon Sep 17 00:00:00 2001 From: Jeongseok Kang Date: Mon, 27 Apr 2026 11:33:38 +0900 Subject: [PATCH 8/9] docs(BA-5851): close fourth-pass review gaps in BEP-1053 Three concrete implementation gaps from the latest hostile review: 1) Kill switch read pattern: changed from "read etcd at top of decision flow" (hot-path etcd read on every batch failure) to "loaded at startup, refreshed via existing EtcdConfigWatcher (config/provider.py: 20)." Also extends the kill-switch check to the dispatcher worker before claiming a queue row, so flipping the switch mid-incident halts in-flight queued retries. 2) Queue claim deadlock: pinned the SQL to single-row claim using "FOR UPDATE SKIP LOCKED LIMIT 1" inside an UPDATE-from-SELECT. Multiple manager replicas can now claim disjoint rows without contending on the same lock. Sentinel value for failed dispatch ('1970-01-01' timestamptz) made explicit. 3) classify_failure malformed-input fallback: explicitly does NOT default to UNKNOWN when status_data is missing or has missing required keys; instead returns a never-retriable sentinel and logs a WARNING. Only well-formed failures with unrecognized error.name map to UNKNOWN. Prevents retry storms from serialization bugs. Also: dispatcher worker now has a concrete proposed location (sokovan/scheduler/retry_dispatcher.py). Reviewer also confirmed (fourth pass) no duplication with existing queue/outbox patterns: SessionDependencyRow exists but is for kernel deps, not scheduling. Co-Authored-By: Claude Opus 4.7 (1M context) --- proposals/BEP-1053-native-session-retry.md | 24 +++++++++++++++++++--- 1 file changed, 21 insertions(+), 3 deletions(-) diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md index bc57f38dd66..d1f1cd1dddc 100644 --- a/proposals/BEP-1053-native-session-retry.md +++ b/proposals/BEP-1053-native-session-retry.md @@ -153,7 +153,7 @@ delay = min(delay, max_retry_delay or MAX_RETRY_DELAY) Two distinct concepts, kept separate to avoid the precedence trap of "default doubles as kill switch": - **Cluster default** in etcd: `config/manager/retry_policy_default` (ship default `max_retries=0` — no behavior change). This is a default; per-session policy wins on merge. Effective policy = deep-merge of cluster default and per-session policy. -- **Cluster kill switch** in etcd: `config/manager/retry_disabled` (boolean, default `false`). Checked at the **top** of the decision flow, before any policy merge. When `true`, no retries are scheduled regardless of per-session policy. Useful for incident response (e.g., disabling retries cluster-wide during a registry outage that would otherwise cause a retry storm). +- **Cluster kill switch** in etcd: `config/manager/retry_disabled` (boolean, default `false`). Loaded at startup and refreshed via the existing `EtcdConfigWatcher` (`manager/config/provider.py:20`) so changes propagate without per-event etcd reads. Checked at the **top** of the decision flow, before any policy merge, **and** by the dispatcher worker before claiming a queue row — so flipping the switch mid-incident also halts in-flight queued retries. When `true`, no retries are scheduled or dispatched regardless of per-session policy. Useful for incident response (e.g., disabling retries cluster-wide during a registry outage that would otherwise cause a retry storm). **Project / domain default is deferred.** [BEP-1052 (Scoped App Config Redesign)](BEP-1052-scoped-app-config-redesign.md) is concurrently rewriting the project / domain config surface around scoped `AppConfigFragment` rows. Adding `retry_policy_default` to the legacy project config row would conflict with that work. After BEP-1052 lands, a follow-up BEP can wire retry defaults into `AppConfigFragment` as a third precedence layer. @@ -195,12 +195,30 @@ The retry decision is **folded into the existing `SessionEventHandler.handle_bat A sokovan post-processor was considered but rejected for v1: post-processors run *during* scheduling iterations, complicating idempotency and timing without adding capability the event-handler path lacks. -**Dispatch primitive.** `BackgroundTaskManager.start_retriable()` (`common/bgtask/bgtask.py:444`) is **not** suitable: it accepts no `delay` parameter and fires the task immediately via `asyncio.create_task` ("retriable" refers to the task body retrying on failure, not delayed scheduling). Instead, retries are persisted to the new `session_retry_dispatch_queue` table (see Data Model). A periodic loop in the manager (a sokovan-style worker — same cadence as existing periodic tasks) claims rows where `scheduled_at <= now() AND claimed_at IS NULL` via `UPDATE ... SET claimed_at = now() RETURNING ...` (atomic claim under PostgreSQL's row lock) and invokes `SessionService.create_from_params()`. This pattern is durable across manager restarts and matches the outbox approach used by the sibling pipeline orchestrator. +**Dispatch primitive.** `BackgroundTaskManager.start_retriable()` (`common/bgtask/bgtask.py:444`) is **not** suitable: it accepts no `delay` parameter and fires the task immediately via `asyncio.create_task` ("retriable" refers to the task body retrying on failure, not delayed scheduling). Instead, retries are persisted to the new `session_retry_dispatch_queue` table (see Data Model). A periodic loop in the manager (placed under `sokovan/` alongside other periodic workers, e.g., `sokovan/scheduler/retry_dispatcher.py`) claims **one row at a time** via: + +```sql +UPDATE session_retry_dispatch_queue +SET claimed_at = now() +WHERE (parent_session_id, retry_count) = ( + SELECT parent_session_id, retry_count + FROM session_retry_dispatch_queue + WHERE scheduled_at <= now() + AND claimed_at IS NULL + AND dispatched_at IS NULL + ORDER BY scheduled_at + FOR UPDATE SKIP LOCKED + LIMIT 1 +) +RETURNING parent_session_id, retry_count; +``` + +`FOR UPDATE SKIP LOCKED` lets multiple manager replicas claim disjoint rows without contention, and `LIMIT 1` avoids multi-row claim deadlocks. The worker invokes `SessionService.create_from_params()` for the claimed row and stamps `dispatched_at = now()` on success or `dispatched_at = '1970-01-01'::timestamptz` (sentinel) on failure. This pattern is durable across manager restarts and matches the outbox approach used by the sibling pipeline orchestrator. The decision flow inside `handle_batch_result` (in the `SessionFailureAnycastEvent` arm): 1. Load the parent session. **Short-circuit unless `session.session_type == SessionTypes.BATCH`** — interactive and inference sessions share `SessionEventHandler` and are out of scope for v1. Then, if `retry_count >= max_retries`, emit `session.retry_exhausted` and return. -2. Classify failure via `classify_failure(session, status_data)`. The shape of `status_data["error"]` is defined by `manager/exceptions.py:convert_to_status_data` (returns `ErrorStatusInfo` / `ErrorDetail` TypedDicts at line 97); classification reads `error.name` and `error.src` to map to a `RetryEligibleCause`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return. +2. Classify failure via `classify_failure(session, status_data)`. The shape of `status_data["error"]` is defined by `manager/exceptions.py:convert_to_status_data` (returns `ErrorStatusInfo` / `ErrorDetail` TypedDicts at line 97); classification reads `error.name` and `error.src` to map to a `RetryEligibleCause`. **Malformed-input fallback:** if `status_data` is `None`, `status_data["error"]` is missing, or required keys (`name`, `src`) are missing, `classify_failure` does **not** return `UNKNOWN`. Instead it logs a WARNING and returns the hardcoded never-retriable sentinel — a malformed error envelope is treated as a permanent failure to avoid retry storms on serialization bugs. Only well-formed failures with an unrecognized `error.name` map to `UNKNOWN`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return. 3. Inside the session repository's `begin_session()` transaction, lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` and re-read `retry_count`. 4. Compute `delay` per the formula above. 5. `INSERT` into `session_retry_dispatch_queue` with `(parent_session_id, retry_count + 1, now() + delay)`. The PK on `(parent_session_id, retry_count)` makes this idempotent: a duplicate dispatch (handler replay, concurrent handlers) hits a unique-violation and the `INSERT` is skipped. Emit `session.retry_scheduled`. From 7f679935f5748385d5cc8ab491acb858f24f728c Mon Sep 17 00:00:00 2001 From: Jeongseok Kang Date: Mon, 27 Apr 2026 13:37:49 +0900 Subject: [PATCH 9/9] docs(BA-5851): split BEP-1053 into two-tier batch resilience design Reviewer feedback (paraphrased): - "max_retries-style closed enum on the manager side breaks extensibility (seen this fail before with hardcoded runtime classification)." - "Most batch retry should live on the agent." - "Resource/node-level failures (OOM, disconnect) should be rescheduled to a different node, not retried in place; don't mutate resource allocation." The original BEP-1053 stacked all of this into a single per-session RetryPolicy + queue + child sessions. Pivot to two narrower BEPs that ship independently: BEP-1053 (re-scoped): "Agent-level Batch Retry" - batch_retries / batch_retry_delay knobs on session creation - agent re-runs the entrypoint inside the same kernel - no manager-side state, no new tables, no new events - ~100 lines, smallest possible delta on Agent.execute_batch BEP-1054 (new): "Session Rescheduling on Terminal Failure" - new RescheduleFailedBatchSessionsLifecycleHandler under sokovan - reuses phase_attempts (no new counter), SERVICE_MAX_RETRIES (now made configurable per scaling group, closes its FIXME) - extends the existing expired -> PENDING transition pattern to fire from terminal-failure with a node-level cause - failure classification is etcd pattern config (extensible), not a closed enum in code - same SessionRow, same allocation; no parent_session_id, no child sessions, no queue table The two BEPs compose: agent-side script retries first; if all attempts fail and the cause is node-level, the scheduler reschedules to a fresh node; on the new node, agent-side retries run again. Each attempt is recorded in scheduling history. Registry updated, news fragment rewritten. Pivot rationale captured at docs/investigation/bep-1053-design-pivot.md. Co-Authored-By: Claude Opus 4.7 (1M context) --- changes/11322.doc.md | 2 +- proposals/BEP-1053-agent-batch-retry.md | 131 ++++++++ proposals/BEP-1053-native-session-retry.md | 293 ------------------ ...ession-rescheduling-on-terminal-failure.md | 169 ++++++++++ proposals/README.md | 3 +- 5 files changed, 303 insertions(+), 295 deletions(-) create mode 100644 proposals/BEP-1053-agent-batch-retry.md delete mode 100644 proposals/BEP-1053-native-session-retry.md create mode 100644 proposals/BEP-1054-session-rescheduling-on-terminal-failure.md diff --git a/changes/11322.doc.md b/changes/11322.doc.md index 4c246d3f07e..8c3f2ced091 100644 --- a/changes/11322.doc.md +++ b/changes/11322.doc.md @@ -1 +1 @@ -Add BEP-1053 proposing native session-level retry for batch sessions, with a `RetryPolicy` schema modeled after Apache Airflow and adapted to Backend.AI's event-driven model +Add BEP-1053 (agent-level batch entrypoint retry) and BEP-1054 (session rescheduling on terminal failure) covering the two-tier batch resilience design — in-script retry stays on the agent; node-level failures reschedule the same session through the existing scheduler lifecycle handlers diff --git a/proposals/BEP-1053-agent-batch-retry.md b/proposals/BEP-1053-agent-batch-retry.md new file mode 100644 index 00000000000..91074f8620f --- /dev/null +++ b/proposals/BEP-1053-agent-batch-retry.md @@ -0,0 +1,131 @@ +--- +Author: Jeongseok Kang (jskang@lablup.com) +Status: Draft +Created: 2026-04-27 +Created-Version: 26.5.0 +Target-Version: +Implemented-Version: +--- + +# Agent-level Batch Retry + +## Related Issues + +- JIRA: BA-5851 +- GitHub Epic: #11320 +- GitHub: #11321 +- Companion BEP: [BEP-1054 — Session Rescheduling on Terminal Failure](BEP-1054-session-rescheduling-on-terminal-failure.md) + +## Motivation + +When a `BATCH` session's entrypoint exits non-zero, the session is marked failed and the user must manually re-submit. Most batch failures in practice are transient (a flaky network call, a downstream service hiccup, an intermittent dependency error) and a simple in-place re-run would have succeeded. Today the user pays the cost of re-creating the session — re-scheduling, re-pulling the image, re-mounting volumes — for a problem that is purely inside the script. + +This BEP adds a small **agent-side** knob: re-run the batch entrypoint inside the same kernel up to N times before reporting failure. It is the simpler, smaller half of the batch-retry feature; the companion BEP-1054 covers the case where the failure is at the *node* level and a fresh schedule is needed. + +### Goals + +- Opt-in retry of the batch entrypoint inside an existing kernel. +- No new manager-side state, tables, or events. +- Default `batch_retries = 0` keeps current behavior. +- Per-session knob; no policy framework needed at this layer. + +### Non-goals + +- Failures before the kernel is running (image pull, scheduling). Those go to BEP-1054. +- OOM and node-level failures. Re-running on the same node typically does not help; BEP-1054 handles them by rescheduling. +- A user-supplied retry-policy DSL with backoff and classification. Out of scope for v1; if needed, accrue evidence first and design separately. + +## Current Design + +The agent runs batch entrypoints in `Agent.execute_batch()` (`src/ai/backend/agent/agent.py:2406`). The path: + +1. Kernel reaches the running state. +2. If `kernel_obj.session_type == SessionTypes.BATCH` (`agent.py:2274`), the agent enqueues `execute_batch(session_id, kernel_id, startup_command, batch_timeout)` into `_ongoing_exec_batch_tasks` (line 840). +3. `execute_batch` invokes the kernel runner via `kernel.execute(...)` once. +4. On a non-zero exit code (or timeout), the agent emits `SessionFailureAnycastEvent` and `SessionFailureBroadcastEvent` (lines 2375, 2389, 2464, 2478, 2492). +5. On success, it emits `SessionSuccessAnycastEvent`/`SessionSuccessBroadcastEvent`. + +There is no in-script retry — the entrypoint runs exactly once per session. `RestartTracker` (line 757) handles *kernel* restart on agent crash recovery, not script re-execution. + +## Proposed Design + +### Knob + +Two new fields on the batch session creation request, plumbed through the existing kernel-config path that already carries `startup_command` and `batch_timeout`: + +| Field | Type | Default | Meaning | +|---|---|---|---| +| `batch_retries` | int (≥ 0) | `0` | Maximum number of additional `execute_batch` attempts after the first. Total attempts = `1 + batch_retries`. | +| `batch_retry_delay` | float seconds (≥ 0) | `0.0` | Wait between attempts. Constant; no backoff at this layer. | + +The two fields sit alongside `startup_command`, `bootstrap_script`, and `batch_timeout` in the session creation DTO. They are batch-only — the agent ignores them when `session_type != SessionTypes.BATCH`. + +### Execution loop + +`execute_batch` becomes: + +```python +async def execute_batch(self, session_id, kernel_id, startup_command, batch_timeout, + batch_retries: int = 0, batch_retry_delay: float = 0.0): + last_exit_code: int | None = None + for attempt in range(batch_retries + 1): + if attempt > 0: + log.info("execute_batch(k:{}) retry attempt {}/{}", kernel_id, attempt, batch_retries) + await asyncio.sleep(batch_retry_delay) + last_exit_code = await self._run_batch_once(session_id, kernel_id, startup_command, batch_timeout) + if last_exit_code == 0: + await self._emit_session_success(session_id, kernel_id) + return + # else: non-zero exit -> retry if attempts remain + # exhausted + await self._emit_session_failure(session_id, kernel_id, last_exit_code) +``` + +Only **non-zero exit codes** trigger a retry. Cancellation, timeout, and infrastructure errors (kernel disconnect, container crash) do **not** loop here: +- Cancellation propagates as today. +- Timeout (`KernelLifecycleEventReason.TASK_TIMEOUT`, `agent.py:2492`) emits failure as today; rerunning a script that already ran past `batch_timeout` is unhelpful. +- Container-level failures escalate to BEP-1054's domain. + +### Observability + +- `bai_agent_batch_retry_attempted_total{session_id_type=batch}` counter (per attempt beyond the first). +- `bai_agent_batch_retry_succeeded_total` counter (incremented when a retry attempt exits zero). +- `bai_agent_batch_retry_exhausted_total` counter (incremented when the loop ends with non-zero). +- Each retry attempt logged at INFO with `(kernel_id, attempt, max_attempts)`. +- The existing failure event is emitted only on final exhaustion; no new event types. + +### What does **not** change + +- Session lifecycle, statuses, or transitions. +- Manager-side handlers (`SessionEventHandler`, sokovan). +- Database schema. +- `creation_id`, `parent_session_id` (does not exist), retry chain (does not exist). +- API surface beyond the two new fields on the create request. + +The only manager-side change is plumbing `batch_retries` and `batch_retry_delay` from the create request into the kernel config payload that the agent already receives. + +## Migration / Compatibility + +- Default `batch_retries = 0` preserves current behavior for every existing caller. +- New fields are additive on the create request and on responses (echoed back for visibility). +- No Alembic migration required. +- Operators have a per-session opt-out by leaving the field unset; no global kill switch needed because the feature is opt-in. + +## Implementation Plan + +Two PRs: + +1. **BEP draft** (this document) plus the companion BEP-1054 — #11321. +2. **Agent change:** extend `execute_batch` with the retry loop, plumb `batch_retries`/`batch_retry_delay` from kernel config, add metrics, unit tests around the loop semantics. +3. **Client surface:** SDK v2 + CLI v2 accept the two new fields on `./bai session create -t batch`. REST v2 / GraphQL v2 echo them on session info responses. + +Tests live with the code under test. The agent's batch executor has existing test scaffolding; the loop is the smallest possible delta. + +Estimated effort: under one week for one engineer, given the constrained scope. + +## References + +- Companion: [BEP-1054 — Session Rescheduling on Terminal Failure](BEP-1054-session-rescheduling-on-terminal-failure.md) +- Working draft of the prior single-BEP design and the pivot rationale: `docs/investigation/bep-1053-design-pivot.md` +- Apache Airflow's `retries` parameter (the inspirational reference): `airflow-core/src/airflow/models/taskinstance.py:1109-1159` +- [BEP-1030: Sokovan Scheduler Status Transition Design](BEP-1030-sokovan-scheduler-status-transition.md) diff --git a/proposals/BEP-1053-native-session-retry.md b/proposals/BEP-1053-native-session-retry.md deleted file mode 100644 index d1f1cd1dddc..00000000000 --- a/proposals/BEP-1053-native-session-retry.md +++ /dev/null @@ -1,293 +0,0 @@ ---- -Author: Jeongseok Kang (jskang@lablup.com) -Status: Draft -Created: 2026-04-27 -Created-Version: 26.5.0 -Target-Version: -Implemented-Version: ---- - -# Native Session Retry - -## Related Issues - -- JIRA: BA-5851 -- GitHub Epic: #11320 -- GitHub: #11321 - -## Motivation - -Backend.AI core has no session-level retry. A `BATCH` session that fails — image pull error, transient agent failure, OOM, scheduler timeout, kernel non-zero exit — becomes terminal in `ERROR`, and the user must manually re-create it. The only retry-shaped logic in core today is infrastructure-level: DB transaction retry (`account_manager/models/utils.py:execute_with_txn_retry`), kernel restart on the agent (`agent.py:RestartTracker`), and `tenacity`-wrapped HTTP/socket retries. None of these handle "the session as a whole failed; create a fresh one with the same spec." - -The retry concern is therefore pushed to every higher-level orchestrator on top of Backend.AI, each of which re-implements the same logic with inconsistent semantics. Lifting retry into core gives one source of truth, resilience for plain batch workloads, and lets orchestrators thin out their own retry layers. - -### Goals - -- Opt-in automatic retry for `BATCH` sessions with a `RetryPolicy` accepted at session creation. -- Each retry is a fresh session linked to its parent — no kernel reuse, no new status state. -- Default `max_retries=0` keeps current behavior intact. -- A single user-facing knob: setting `max_retries > 0` retries on any non-permanent failure. - -## Current Design - -### Session lifecycle - -`SessionStatus` (`src/ai/backend/manager/data/session/types.py:30-50`) defines the lifecycle: - -``` -PENDING → SCHEDULED → PREPARING → PULLING → PREPARED → CREATING → RUNNING → TERMINATING → TERMINATED -``` - -`terminal_statuses()` (line 109) is `{ERROR, TERMINATED, CANCELLED}` — no transitions out. `retriable_statuses()` (line 118) is unrelated to this BEP: it tells the scheduler which **startup** states are still safe to re-dispatch *within the same session*. This BEP introduces a separate concept — re-creating a fresh session after the previous one has gone terminal. - -### Session creation path - -``` -POST /v2/sessions - → CreateFromParamsAction - → SessionService.create_from_params (services/session/service.py:255) - → repository → SessionRow (models/session/row.py:384) -``` - -`SessionRow.creation_id` (lines 389–390) is a 32-character idempotency key reused across kernel placements; today it is generated as `secrets.token_urlsafe(16)` (`services/session/service.py:1593`). It is **not** extended to encode retry chains — those use a separate first-class column (see Data Model). - -There are no fields for `parent_session_id`, `retry_count`, `max_retries`, or a retry policy on `SessionRow`. - -### Termination event handling - -`SessionEventHandler` (`event_dispatcher/handlers/session.py:52`) already subscribes to the relevant events: - -| Method | Event | Line | -|---|---|---| -| `handle_session_started` | `SessionStartedAnycastEvent` | 88 | -| `handle_session_cancelled` | `SessionFailureAnycastEvent` | 105 | -| `handle_session_terminating` | `SessionTerminatingAnycastEvent` | 118 | -| `handle_session_terminated` | `SessionTerminatedAnycastEvent` | 130 | - -`handle_session_terminated` already consults `session.status_data["error"]` for endpoint-route bookkeeping, so the failure metadata needed for retry classification is already on hand at this point. What is missing is the decision: "should we spawn a child session?" - -No prior BEP covers session retry or fault tolerance. BEP-1030 (scheduler status transitions) covers in-session retries by the scheduler, not session re-creation. - -## Proposed Design - -### Mental model - -`max_retries > 0` means "retry on failure." Users should not need to opt in twice. Apache Airflow takes the same stance — any non-fatal exception triggers retry up to `retries`. Classification's job is only to exclude failures that semantically must not retry (cancellation, validation, quota), not to gate ordinary failure modes. - -### `RetryPolicy` schema - -A Pydantic DTO at `src/ai/backend/common/dto/manager/v2/session/retry_policy.py`, matching the v2 DTO location used by other recent BEPs. Per `src/ai/backend/manager/data/CLAUDE.md`, `data/` is reserved for frozen dataclasses with no framework deps; Pydantic models live under `common/dto/` so they can be shared across REST v2 and GraphQL. Schema modeled on Airflow's parameter surface: - -```python -class BackoffStrategy(StrEnum): - FIXED = "fixed" - EXPONENTIAL = "exponential" - -class JitterMode(StrEnum): - NONE = "none" - DETERMINISTIC = "deterministic" - RANDOM = "random" - -class RetryEligibleCause(StrEnum): - AGENT_TRANSIENT = "agent_transient" - SCHEDULER_TIMEOUT = "scheduler_timeout" - IMAGE_PULL_FAILURE = "image_pull_failure" - KERNEL_NONZERO_EXIT = "kernel_nonzero_exit" - OOM_KILLED = "oom_killed" - UNKNOWN = "unknown" - - @classmethod - def defaults(cls) -> frozenset["RetryEligibleCause"]: - return frozenset(cls) - -class RetryPolicy(BaseModel): - max_retries: NonNegativeInt = 0 - retry_delay: PositiveFloat = 60.0 - backoff: BackoffStrategy = BackoffStrategy.FIXED - backoff_multiplier: PositiveFloat = 2.0 - max_retry_delay: PositiveFloat | None = 3600.0 - jitter: JitterMode = JitterMode.DETERMINISTIC - jitter_ratio: confloat(ge=0, le=1) = 0.25 - eligible_causes: frozenset[RetryEligibleCause] = Field( - default_factory=RetryEligibleCause.defaults - ) - emit_retry_events: bool = True -``` - -Notable deviations from Airflow: - -- **No callback parameter.** Backend.AI is event-driven; downstream consumers subscribe to `session.retry_*` events instead of registering an `on_retry_callback`. Keeps the policy serializable and the server's behavior fully auditable. -- **Structural cause enum, not exception types.** Backend.AI does not surface user exceptions across the manager/agent boundary the way Airflow does intra-process; classification reads `status_data` instead. -- **`max_retries` is a count.** Total attempts = `1 + max_retries`, matching Backend.AI naming and the existing pipeline orchestrator on top of Backend.AI. - -### Failure classification - -A central `classify_failure(session, status_data) → RetryEligibleCause`. Hardcoded never-retriable causes live outside the enum: `USER_CANCELLED`, `VALIDATION_ERROR`, `QUOTA_EXCEEDED`. Users cannot opt these into retry. - -| Cause | In default eligible set | Notes | -|---|---|---| -| `AGENT_TRANSIENT` | yes | Lost heartbeat, agent restart mid-run. | -| `SCHEDULER_TIMEOUT` | yes | Kernel-creation timeout under cluster pressure. | -| `IMAGE_PULL_FAILURE` | yes | Typo wastes a few seconds with backoff; registry blip is real. | -| `KERNEL_NONZERO_EXIT` | yes | The most common reason batch users want retry. | -| `OOM_KILLED` | yes | Retry without resource bump usually fails again, but exhausting `max_retries` is cheap. | -| `UNKNOWN` | yes | Conservative for unclassified failures. | -| `USER_CANCELLED` | hardcoded never | Permanent. | -| `VALIDATION_ERROR` / `QUOTA_EXCEEDED` | hardcoded never | Permanent. | - -### Backoff formula - -``` -base = retry_delay if backoff == FIXED - min(retry_delay * backoff_multiplier ** retry_count, otherwise - max_retry_delay or MAX_RETRY_DELAY) -delay = apply_jitter(base, mode=jitter, ratio=jitter_ratio, - seed=(session_id, retry_count)) -delay = min(delay, max_retry_delay or MAX_RETRY_DELAY) -``` - -`MAX_RETRY_DELAY` is a hard 24 h ceiling, matching Airflow. Deterministic jitter is `SHA1(session_id || retry_count) mod (base * jitter_ratio)`, yielding reproducible delays — useful for tests. Random jitter samples uniformly in `[base, base * (1 + jitter_ratio))`. - -### Defaults and kill switch - -Two distinct concepts, kept separate to avoid the precedence trap of "default doubles as kill switch": - -- **Cluster default** in etcd: `config/manager/retry_policy_default` (ship default `max_retries=0` — no behavior change). This is a default; per-session policy wins on merge. Effective policy = deep-merge of cluster default and per-session policy. -- **Cluster kill switch** in etcd: `config/manager/retry_disabled` (boolean, default `false`). Loaded at startup and refreshed via the existing `EtcdConfigWatcher` (`manager/config/provider.py:20`) so changes propagate without per-event etcd reads. Checked at the **top** of the decision flow, before any policy merge, **and** by the dispatcher worker before claiming a queue row — so flipping the switch mid-incident also halts in-flight queued retries. When `true`, no retries are scheduled or dispatched regardless of per-session policy. Useful for incident response (e.g., disabling retries cluster-wide during a registry outage that would otherwise cause a retry storm). - -**Project / domain default is deferred.** [BEP-1052 (Scoped App Config Redesign)](BEP-1052-scoped-app-config-redesign.md) is concurrently rewriting the project / domain config surface around scoped `AppConfigFragment` rows. Adding `retry_policy_default` to the legacy project config row would conflict with that work. After BEP-1052 lands, a follow-up BEP can wire retry defaults into `AppConfigFragment` as a third precedence layer. - -### Data model - -One Alembic migration adds to `sessions`: - -| Column | Type | Description | -|---|---|---| -| `parent_session_id` | `UUID NULL` | Self-FK to `sessions.id`; null for the first attempt. | -| `retry_count` | `INT NOT NULL DEFAULT 0` | 0 for the first attempt. | -| `max_retries` | `INT NOT NULL DEFAULT 0` | Denormalized from policy for cheap filters. | -| `retry_policy` | `JSONB NULL` | Full policy. | -| `retry_cause` | `TEXT NULL` | Classified cause that triggered the most recent retry into this attempt. | - -The migration also adds a **partial unique index** on `(parent_session_id, retry_count) WHERE parent_session_id IS NOT NULL`. This is the actual idempotency guarantee for retry dispatch: even if two workers race past the parent row lock (different transactions, different timing), the second `INSERT` of a child with the same `(parent, attempt-number)` fails on the unique violation. `creation_id` is unchanged and remains a per-attempt random token (no retry encoding). - -A second small table `session_retry_dispatch_queue` is added for durable delayed dispatch (see "Decision and dispatch"): - -| Column | Type | Description | -|---|---|---| -| `parent_session_id` | `UUID NOT NULL` | FK to `sessions.id`. | -| `retry_count` | `INT NOT NULL` | Target attempt number (= parent.retry_count + 1). | -| `scheduled_at` | `TIMESTAMPTZ NOT NULL` | Earliest dispatch time. | -| `claimed_at` | `TIMESTAMPTZ NULL` | Set when a dispatcher worker claims the row. | -| `dispatched_at` | `TIMESTAMPTZ NULL` | Set when the child session has been created. | - -Primary key `(parent_session_id, retry_count)` — the same constraint that protects `sessions` also serializes queue inserts. The queue lets retry decisions survive manager restarts, mirrors the durable-outbox pattern used by the existing pipeline orchestrator on top of Backend.AI, and avoids inventing in-memory scheduling. - -`parent_session_id`, `retry_count`, and `max_retries` are first-class columns because they appear in filters, joins, and the unique index; the rest live in JSONB. `parent_session_id` is the canonical query for "show me the retry chain of this session." **No new history table** — the chain is already a linked list of real `SessionRow`s, each with its own status, kernels, logs, and `status_data`. The migration is idempotent and backportable per `src/ai/backend/manager/models/alembic/README.md`. - -### Decision and dispatch - -The retry decision is **folded into the existing `SessionEventHandler.handle_batch_result`** (`event_dispatcher/handlers/session.py:152`), not added as a sibling handler. Rationale: - -- `SessionFailureAnycastEvent` is **already** subscribed by `handle_batch_result` (`event_dispatcher/dispatch.py:520`). Adding a second handler on the same event would race against bookkeeping work (`set_session_result`, etc.) and depend on undefined dispatch ordering. -- Failure metadata (`session.status_data["error"]`) is already loaded in `handle_batch_result` for the existing failure path; the retry decision can reuse it without new DB roundtrips. -- The handler runs after the session has reached a terminal status, so parent state is settled, and the change does not interact with the recently refactored sokovan termination flow (#11250 — `mark_sessions_for_termination()` in `sokovan/scheduling_controller/scheduling_controller.py:266`). - -A sokovan post-processor was considered but rejected for v1: post-processors run *during* scheduling iterations, complicating idempotency and timing without adding capability the event-handler path lacks. - -**Dispatch primitive.** `BackgroundTaskManager.start_retriable()` (`common/bgtask/bgtask.py:444`) is **not** suitable: it accepts no `delay` parameter and fires the task immediately via `asyncio.create_task` ("retriable" refers to the task body retrying on failure, not delayed scheduling). Instead, retries are persisted to the new `session_retry_dispatch_queue` table (see Data Model). A periodic loop in the manager (placed under `sokovan/` alongside other periodic workers, e.g., `sokovan/scheduler/retry_dispatcher.py`) claims **one row at a time** via: - -```sql -UPDATE session_retry_dispatch_queue -SET claimed_at = now() -WHERE (parent_session_id, retry_count) = ( - SELECT parent_session_id, retry_count - FROM session_retry_dispatch_queue - WHERE scheduled_at <= now() - AND claimed_at IS NULL - AND dispatched_at IS NULL - ORDER BY scheduled_at - FOR UPDATE SKIP LOCKED - LIMIT 1 -) -RETURNING parent_session_id, retry_count; -``` - -`FOR UPDATE SKIP LOCKED` lets multiple manager replicas claim disjoint rows without contention, and `LIMIT 1` avoids multi-row claim deadlocks. The worker invokes `SessionService.create_from_params()` for the claimed row and stamps `dispatched_at = now()` on success or `dispatched_at = '1970-01-01'::timestamptz` (sentinel) on failure. This pattern is durable across manager restarts and matches the outbox approach used by the sibling pipeline orchestrator. - -The decision flow inside `handle_batch_result` (in the `SessionFailureAnycastEvent` arm): - -1. Load the parent session. **Short-circuit unless `session.session_type == SessionTypes.BATCH`** — interactive and inference sessions share `SessionEventHandler` and are out of scope for v1. Then, if `retry_count >= max_retries`, emit `session.retry_exhausted` and return. -2. Classify failure via `classify_failure(session, status_data)`. The shape of `status_data["error"]` is defined by `manager/exceptions.py:convert_to_status_data` (returns `ErrorStatusInfo` / `ErrorDetail` TypedDicts at line 97); classification reads `error.name` and `error.src` to map to a `RetryEligibleCause`. **Malformed-input fallback:** if `status_data` is `None`, `status_data["error"]` is missing, or required keys (`name`, `src`) are missing, `classify_failure` does **not** return `UNKNOWN`. Instead it logs a WARNING and returns the hardcoded never-retriable sentinel — a malformed error envelope is treated as a permanent failure to avoid retry storms on serialization bugs. Only well-formed failures with an unrecognized `error.name` map to `UNKNOWN`. If the cause is hardcoded never-retriable, or not in `policy.eligible_causes`, return. -3. Inside the session repository's `begin_session()` transaction, lock the parent row with `sa.select(SessionRow).where(SessionRow.id == parent.id).with_for_update()` and re-read `retry_count`. -4. Compute `delay` per the formula above. -5. `INSERT` into `session_retry_dispatch_queue` with `(parent_session_id, retry_count + 1, now() + delay)`. The PK on `(parent_session_id, retry_count)` makes this idempotent: a duplicate dispatch (handler replay, concurrent handlers) hits a unique-violation and the `INSERT` is skipped. Emit `session.retry_scheduled`. -6. The dispatcher worker eventually claims the row, runs `SessionService.create_from_params` with a `CreateFromParamsAction` derived from the parent, and stamps `dispatched_at`. The child `INSERT` is the second idempotency boundary: the partial unique index on `sessions.(parent_session_id, retry_count)` rejects duplicate child rows even if two workers claim the same queue row through PG bug, replication lag, or operational error. - -The child inherits `retry_policy`, sets `parent_session_id` to the parent, and `retry_count = parent.retry_count + 1`. The `CreateFromParamsAction` carries the same image, mounts, `resource_slots`, env, cluster spec, and batch entrypoint as the parent. - -**Failure mode of the retry decision.** If `classify_failure` raises, the session stays in its terminal state and the failure is logged at ERROR level — no retry, no crash propagation. If the queue `INSERT` fails (DB unavailable), the parent's `status_data` is annotated with the dispatch failure and `session.retry_exhausted` is emitted. The decision must not raise out of `handle_batch_result`; an unhandled exception there would also break existing batch-result bookkeeping. - -**Failure mode of the dispatcher worker.** If `create_from_params` raises after the queue row is claimed, the worker stamps `dispatched_at` to a sentinel value and emits `session.retry_exhausted` with the underlying error. Manager restart while a row is claimed-but-not-dispatched: the worker re-claims rows whose `claimed_at` is older than a configurable lease (e.g., 5 minutes) on startup. - -**No new `RETRYING` status.** The parent goes to `ERROR` as today; the child starts in `PENDING` as today. A computed `retry_state` (resolved at the API layer, not stored) tells clients "attempt N of M" or "this session has a pending child." This avoids touching the scheduler state machine entirely. - -### API surface - -REST v2 (`api/rest/v2/sessions/`): - -| Method | Path | Purpose | -|---|---|---| -| `POST` | `/sessions` | Accept optional `retry_policy` in `SessionCreateRequest`. | -| `GET` | `/sessions/{id}` | Return `parent_session_id`, `retry_count`, `max_retries`, `retry_policy`, `retry_cause`, plus computed `retry_chain` (oldest → newest IDs). | -| `GET` | `/sessions/{id}/attempts` | Return the chain with the status of each attempt. | - -GraphQL v2: mirror in `api/gql/session/types.py` — `parentSession`, `retryCount`, `maxRetries`, `retryPolicy`, `retryCause`, `retryChain` resolver. - -Client SDK v2 + CLI v2: expose the new fields; `./bai session info` shows `attempt N of M` and links to the parent. - -No retry mutation in v1; manual retry is deferred until the auto path stabilizes. - -### Observability - -- Counters: `bai_session_retry_scheduled_total{cause}`, `bai_session_retry_exhausted_total{cause}`, `bai_session_retry_succeeded_total`. -- Events: `session.retry_scheduled`, `session.retry_exhausted` — consumable by the webhook plugin, replacing the role of Airflow's `on_retry_callback` for downstream consumers. -- Audit log entry per retry dispatch: cause and attempt N of M. - -## Migration / Compatibility - -- Default `max_retries=0` keeps behavior unchanged for every existing caller. -- All new columns are nullable or default to safe zero values; the Alembic migration is purely additive. -- Existing GraphQL and REST clients continue to work; new fields are additive on responses. -- Operators opt in by setting the cluster default in etcd or a per-session policy. -- External orchestrators may continue using their own retry layers; migration to native retry is independent and incremental. -- No breaking changes. - -### Quota and accounting - -A retry attempt is a fresh `SessionRow` and counts against the user's concurrent-session limit while it is alive — same as if the user had re-submitted manually. The previous attempt's resource consumption is not refunded; this matches the principle that "actual GPU/CPU time was spent, regardless of why the session ended." The API exposes the chain so accounting tools can group attempts under one logical job if they choose. - -### Operational kill switch - -`config/manager/retry_disabled` (etcd, boolean) is the cluster-level kill switch — see "Defaults and kill switch" above. Per-project / per-user kill switches are deferred until the project-default layer lands (see [BEP-1052](BEP-1052-scoped-app-config-redesign.md) dependency above). - -## Implementation Plan - -Six PRs, each tracked by its own sub-issue under #11320: - -1. **BEP draft** (this document) — #11321. -2. **Foundation:** `RetryPolicy` DTO, `classify_failure`, backoff utility with deterministic jitter. Pure, no I/O, unit-test heavy. -3. **Schema:** Alembic migration adding `SessionRow` retry columns (with the partial unique index) and the `session_retry_dispatch_queue` table; repository read/write for the retry chain. -4. **Retry decision:** fold the decision into `handle_batch_result` in `SessionEventHandler`, queue insert with idempotency, etcd kill switch and cluster default, counters/events/audit. -5. **Dispatcher worker:** periodic claim loop on `session_retry_dispatch_queue`, `SessionService.create_from_params` extension to inherit retry context, lease-based recovery on manager restart. -6. **API surface:** REST v2 and GraphQL v2 fields, `attempts` endpoint. -7. **Client:** SDK v2, CLI v2 (`./bai session info` retry view), user docs. - -Tests live with the code under test. Cross-cutting integration tests — transient → retry → success, exhaustion, concurrent dispatch idempotency, jitter determinism, manager-restart recovery of claimed-but-undispatched queue rows — ship with the dispatcher-worker PR. Estimated effort: four to five weeks for one engineer. - -## References - -- Working draft: `docs/investigation/native-session-retry-plan.md` -- Apache Airflow retry implementation: `airflow-core/src/airflow/models/taskinstance.py:1109-1159` -- [BEP-1030: Sokovan Scheduler Status Transition Design](BEP-1030-sokovan-scheduler-status-transition.md) -- Alembic backport strategy: `src/ai/backend/manager/models/alembic/README.md` diff --git a/proposals/BEP-1054-session-rescheduling-on-terminal-failure.md b/proposals/BEP-1054-session-rescheduling-on-terminal-failure.md new file mode 100644 index 00000000000..44a07e5d60b --- /dev/null +++ b/proposals/BEP-1054-session-rescheduling-on-terminal-failure.md @@ -0,0 +1,169 @@ +--- +Author: Jeongseok Kang (jskang@lablup.com) +Status: Draft +Created: 2026-04-27 +Created-Version: 26.5.0 +Target-Version: +Implemented-Version: +--- + +# Session Rescheduling on Terminal Failure + +## Related Issues + +- JIRA: BA-5851 +- GitHub Epic: #11320 +- GitHub: #11321 +- Companion BEP: [BEP-1053 — Agent-level Batch Retry](BEP-1053-agent-batch-retry.md) + +## Motivation + +Some session failures are **node-level**: the kernel was OOM-killed on this host, the agent disconnected mid-run, the registry route used by this scaling group is briefly down, the network namespace setup failed for a node-specific reason. For these cases, re-running the script in place — Backend.AI's existing scheduler-internal retries, or BEP-1053's agent-level batch retry — does not help. What does help is **rescheduling the same session to a different node**, with the same resource allocation. + +Today, terminal-failure sessions stay terminal. There is no path that takes a session in `ERROR` and pushes it back through the scheduler. Operators have to ask users to re-create their sessions, often after diagnosing that the failure was the host's fault, not the user's. This BEP closes that gap. + +It is the companion to [BEP-1053](BEP-1053-agent-batch-retry.md), which handles in-script retry; together they cover the two distinct retry surfaces. They are designed to ship independently. + +### Goals + +- Re-dispatch a terminal-failed `BATCH` session through the scheduler when the failure is classified as **node-level**. +- Reuse existing scheduler infrastructure: `SessionLifecycleHandler`, `phase_attempts`, scheduling history, the `expired → PENDING` transition pattern. +- Make failure classification **operator-extensible** — etcd-driven pattern config, not a closed enum in code. +- Promote the standing `SERVICE_MAX_RETRIES = 5 # FIXME: make configurable` (`manager/defs.py:121`) to a real configuration knob as a side effect. +- Default off; opt-in per scaling group. + +### Non-goals + +- Mutating resource allocation (no "give it more memory and retry"). Resource decisions stay with the user/admin. +- User-facing per-session `RetryPolicy` with backoff/jitter/max. Rescheduling is operator-policy, not user-policy. +- Interactive or inference sessions. INTERACTIVE is user-driven; INFERENCE has BEP-1049 deployment-route handling. +- Re-running the user script in place. That is BEP-1053's job. + +## Current Design + +### Session lifecycle and terminal status + +`SessionStatus` (`src/ai/backend/manager/data/session/types.py:30-50`) defines the lifecycle. `terminal_statuses()` (line 109) is `{ERROR, TERMINATED, CANCELLED}` — no transitions out today. `retriable_statuses()` (line 118) is the scheduler's *in-session* retriable set; it does not apply to sessions already in `ERROR`. + +### Sokovan lifecycle handlers + +Periodic `SessionLifecycleHandler`s drive scheduler decisions (`sokovan/scheduler/handlers/`). Each declares `success / need_retry / expired / give_up` outcomes and the status transitions for each (`base.py:62-93`). Existing handlers include `CheckPreconditionLifecycleHandler` and `StartSessionsLifecycleHandler`, which use the **`expired → PENDING`** transition pattern (`check_precondition.py:67`, `start_sessions.py:78`) — the canonical "re-schedule this session" mechanism, scoped today to startup-stage timeouts. + +### Existing counters and caps + +- `phase_attempts` (`sokovan/data/lifecycle.py:322`): per-session attempt counter sourced from scheduling history (`coordinator.py:756`). Documented as "give_up when >= max_retries." +- `SERVICE_MAX_RETRIES = 5 # FIXME: make configurable` (`manager/defs.py:121`): the global cap, used by both session and deployment coordinators (`coordinator.py:1228`, `deployment/coordinator.py:764`). + +### Failure metadata + +When a session fails, `SessionRow.status_data` carries `{"error": {"name": ..., "src": ...}}` per `manager/exceptions.py:convert_to_status_data` and the `ErrorStatusInfo` / `ErrorDetail` TypedDicts (line 97). The shape is stable. + +### What is missing + +A handler that fires on **terminal-failure** sessions, classifies the failure, and either rescheduples or accepts the failure. Today's handlers run on non-terminal sessions only. + +## Proposed Design + +### A new lifecycle handler: `RescheduleFailedBatchSessionsLifecycleHandler` + +Lives at `sokovan/scheduler/handlers/lifecycle/reschedule_failed_batch.py`, alongside the existing handlers. Targets sessions where: + +- `session_type == SessionTypes.BATCH` +- `status == ERROR` +- `phase_attempts < effective_max_retries` +- `status_data["error"]` classifies as a *reschedulable* cause (see "Classification" below). + +Outcomes: + +- **`success`** (rescheduling fired): transition `ERROR → PENDING`. Re-uses the existing `expired → PENDING` machinery, just from a new starting status. Increments `phase_attempts` via the standard scheduling-history append. +- **`give_up`** (cap reached, or cause not reschedulable): no transition. Session stays in `ERROR`. +- **`need_retry`** (transient inability to act, e.g., DB contention): no transition; handler retries next cycle. + +The handler reuses **everything** the existing lifecycle handlers reuse: `phase_attempts` from scheduling history is the counter, `SERVICE_MAX_RETRIES` (now configurable, see below) is the cap, the lifecycle-coordinator path applies the transition. No new column on `SessionRow`. No queue table. No child sessions. + +### Same session, not a child + +A reschedule keeps the original `SessionRow` — same `id`, same `creation_id`, same kernels record, same resource allocation. The session re-enters `PENDING` with `phase_attempts` incremented; the scheduler picks a new agent on the next dispatch cycle. The kernels associated with the previous attempt are cleaned up as part of the terminal-state transition that already runs today. + +This is intentionally different from the original BEP-1053 draft: there are no parent-child rows, no retry chain, no `parent_session_id`. The "history" of attempts is what scheduling history already records. + +### Failure classification — extensible, not closed + +A closed enum of causes hardcodes runtime behavior into code; site-specific failure signatures (vendor accelerator faults, registry-specific image-pull errors, custom-plugin failures) cannot be classified without a manager release. Replace the closed enum with a **pattern-based config**, loaded from etcd and refreshed via `EtcdConfigWatcher` (`manager/config/provider.py:20`): + +```yaml +# config/manager/session_failure_classification +default: give_up +by_error_name: + OOMError: reschedule + AgentDisconnected: reschedule + ImagePullError: give_up # agent's tenacity already retried + HeartbeatTimeout: reschedule + ValidationError: give_up + QuotaExceededError: give_up +by_error_src: + agent: reschedule # fallback for agent-side errors not named above +``` + +Resolution order: `by_error_name` (most specific) → `by_error_src` → `default`. The result is one of three closed `Action` values: `reschedule`, `give_up`, or `ignore` (do not handle yet — leave for the next cycle, used rarely). + +The **action catalog** stays a closed enum (the manager has to know what each action means), but the **cause catalog** is open: operators add patterns without code changes. + +Hardcoded never-reschedulable causes: `USER_CANCELLED` (user intent), and any cause that originates *after* the session reached `RUNNING` and the user's script started — those are BEP-1053's domain. The handler short-circuits on these regardless of config. + +### `SERVICE_MAX_RETRIES` becomes configurable + +Same etcd path: `config/manager/scheduler_max_retries`. Read at startup, refreshed via `EtcdConfigWatcher`. Per-scaling-group overrides under `config/scaling-groups/{sg_name}/scheduler_max_retries`. Default `5` (matches current constant). The handler resolves the cap from scaling-group config first, then cluster, then default. Closes the standing `FIXME: make configurable`. + +### Kill switch + +`config/manager/reschedule_disabled` (etcd boolean, default `false`). Loaded at startup, watched. Checked at the top of the handler's per-cycle execution. When `true`, the handler is a no-op for that cycle. Useful for incident response (e.g., stop rescheduling cluster-wide during a cascade). + +### Observability + +- Counters: `bai_session_reschedule_attempted_total{cause}`, `bai_session_reschedule_capped_total{cause}` (cap reached), `bai_session_reschedule_succeeded_total` (subsequent attempt reached `RUNNING`). +- Event: `session.rescheduled` emitted when `ERROR → PENDING` transition fires. Reuses the existing event-publication path from the lifecycle coordinator. +- Audit log entry per reschedule: `(session_id, cause, attempt N of M, source_agent, target_after = scheduler_choice)`. +- The existing scheduling-history rows already record per-attempt timestamps and outcomes; that is the durable trail. + +## Migration / Compatibility + +### Backward compatibility + +- Default `reschedule_disabled = false` *and* default classification config produces no `reschedule` actions for any cause. So **the feature is effectively off until an operator populates the classification config** — zero behavior change on rollout. +- All etcd keys are additive; no existing key changes shape. +- No Alembic migration required. +- `SERVICE_MAX_RETRIES` constant in `manager/defs.py:121` remains as the default if the etcd key is absent. The `FIXME` is closed; the constant becomes a fallback. + +### Quota and accounting + +A reschedule does not create a new `SessionRow`, so concurrent-session limits are unaffected. Resource consumption from the previous attempt is not refunded — the user *did* consume those resources on the failed node — but the next attempt re-uses the same allocation request, so quota is not double-counted. + +### Interaction with BEP-1053 + +The two BEPs are designed to compose: + +- **BEP-1053** runs first inside the failing kernel; non-zero exit → re-run script; only if all attempts fail does the agent emit `SessionFailureAnycastEvent`. +- **BEP-1054** then evaluates the resulting terminal-failure session. If the cause is node-level, the scheduler reschedules. If the cause is "user script failed after all in-place retries," the classification config maps it to `give_up` and the session stays terminal. + +A session can therefore experience: agent-side script retries → manager-side reschedule → on a new node, agent-side script retries again. Each attempt's history is recorded in scheduling history; users see one logical job, operators see the full trail. + +## Implementation Plan + +Five PRs, each tracked under #11320: + +1. **BEP draft** (this document and the companion BEP-1053) — #11321. +2. **Foundation:** `FailureClassifier` (pattern-based, etcd-driven, refreshed via `EtcdConfigWatcher`) and the `Action` enum. Pure logic, unit-test heavy. +3. **`SERVICE_MAX_RETRIES` configurability:** etcd source + per-scaling-group override + fallback to the `defs.py` constant. Closes the standing FIXME. +4. **Lifecycle handler:** `RescheduleFailedBatchSessionsLifecycleHandler`, kill switch, the `ERROR → PENDING` transition (extending the existing pattern to a new starting status), counters/events/audit. +5. **API surface:** session info responses include `reschedule_count` (= `phase_attempts` view) and the latest `reschedule_cause`. No mutation; this is read-only observability. +6. **Client:** SDK v2 + CLI v2 surface the new info fields; user docs. + +Tests live with the code under test. Cross-cutting integration tests — node-level failure → reschedule → success on different agent; cap-reached → terminal; classification-config-empty → terminal; kill-switch-on → no rescheduling — ship with the lifecycle-handler PR. Estimated effort: two to three weeks for one engineer. + +## References + +- Companion: [BEP-1053 — Agent-level Batch Retry](BEP-1053-agent-batch-retry.md) +- Working draft and design pivot rationale: `docs/investigation/bep-1053-design-pivot.md` +- [BEP-1030: Sokovan Scheduler Status Transition Design](BEP-1030-sokovan-scheduler-status-transition.md) +- [BEP-1049: Zero-Downtime Deployment Strategy Architecture](BEP-1049-deployment-strategy-handler.md) — analogous handler-pattern for routes diff --git a/proposals/README.md b/proposals/README.md index f93b0b63f5b..590085e2753 100644 --- a/proposals/README.md +++ b/proposals/README.md @@ -123,7 +123,8 @@ BEP numbers start from 1000. | [1050](BEP-1050-prometheus-query-preset-system.md) | Prometheus Query Preset System | BoKeum Kim | Draft | | [1051](BEP-1051-kata-containers-agent.md) | Kata Containers Agent Backend | Kyujin Cho | Draft | | [1052](BEP-1052-scoped-app-config-redesign.md) | Scoped App Config Redesign | Gyubong Lee | Draft | -| [1053](BEP-1053-native-session-retry.md) | Native Session Retry | Jeongseok Kang | Draft | +| [1053](BEP-1053-agent-batch-retry.md) | Agent-level Batch Retry | Jeongseok Kang | Draft | +| [1054](BEP-1054-session-rescheduling-on-terminal-failure.md) | Session Rescheduling on Terminal Failure | Jeongseok Kang | Draft | | _next_ | _(reserve your number here)_ | | | ## File Structure