docs(BA-5851): Add BEP-1053 (agent batch retry) and BEP-1054 (session rescheduling) by rapsealk · Pull Request #11322 · lablup/backend.ai

rapsealk · 2026-04-27T01:18:06Z

Summary

Adds BEP-1053, capturing the design for native session-level retry in Backend.AI core.

RetryPolicy Pydantic schema modeled on Apache Airflow's parameter surface, with deviations documented.
Failure classification with structural RetryEligibleCause enum and hardcoded never-retry causes.
Decision/dispatch flow via a new event handler with idempotent child creation_id.
Linked-list retry chain on SessionRow (no new history table, no new status).
REST v2 / GraphQL v2 / SDK v2 / CLI v2 surface plan.
Migration is additive, backportable per the Alembic README, and default max_retries=0 preserves current behavior.

Implementation is broken into the remaining sub-issues under the epic.

Refs: #11320 (epic), #11321 (BA-5851) (this sub-issue).

Test plan

BEP rendered correctly on GitHub.
Registry entry resolves (BEP-1053-native-session-retry.md link).
No conflicting BEP number with concurrently merged proposals.
Reviewers confirm the schema, classification table, and decision log capture the agreed design.

🤖 Generated with Claude Code

Captures the design for adding native session-level retry to Backend.AI core, modeled after Apache Airflow's RetryPolicy parameter surface and adapted to Backend.AI's event-driven architecture. Subsequent implementation work tracked under epic. Refs: #11320, #11321 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new BEP proposal (BEP-1053) documenting the design for native session-level retry in Backend.AI core, and registers the BEP number in the proposals index.

Changes:

Reserve BEP number 1053 in the proposals registry.
Add proposals/BEP-1053-native-session-retry.md describing retry policy schema, failure classification, dispatch flow, data model, API surface, and rollout plan.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File	Description
proposals/README.md	Adds BEP-1053 entry to the BEP registry table.
proposals/BEP-1053-native-session-retry.md	Introduces the BEP draft for native session retry design and planned surface/migration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Verified code paths against latest main: - SessionStatus enum at data/session/types.py:30-50; terminal_statuses() at line 109 - SessionEventHandler at event_dispatcher/handlers/session.py:52, with handlers for started/cancelled/terminating/terminated; status_data["error"] already consulted in handle_session_terminated - SessionService.create_from_params at services/session/service.py:255 - SessionRow.creation_id at models/session/row.py:389-390 Drop redundant tables (Decision Log, Airflow-mapping) and the Open Questions section to match the format used by recent BEPs (BEP-1049, BEP-1050). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the Django-style select_for_update() reference with SQLAlchemy 2.x syntax matching the existing pattern in repositories/agent/db_source.py and repositories/deployment/db_source.py: sa.select(SessionRow).where(...) .with_for_update() inside begin_session(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address production-readiness review findings: - Add partial unique index on (parent_session_id, retry_count) as the real idempotency guarantee for retry dispatch. The parent row lock alone is insufficient because creation_id has unique=False (models/session/row.py:390); under concurrent handlers the second INSERT would have succeeded silently. The unique index makes duplicate child creation a hard failure. - Commit to extending SessionEventHandler in event_dispatcher/handlers/ session.py for the retry decision, instead of leaving "two equivalent integration points." Sokovan post-processors run during scheduling iterations, which complicates idempotency without adding capability. Also notes the recent sokovan refactor (#11250 / 8321c79) so future readers see why the post-processor path was rejected. - Specifically name BackgroundTaskManager.start_retriable() (already injected into SessionService at service.py:245,408) as the dispatch primitive instead of vague "background task / event mechanism." - Defer project/domain default layer to a follow-up after BEP-1052 (Scoped App Config Redesign) lands, so this BEP doesn't conflict with in-flight config-surface work. - Document the retry handler's own failure modes (classify_failure raise, start_retriable enqueue failure) and the accounting policy (each retry counts against quota; no refund). - Clarify retriable_statuses() is unrelated (in-session re-dispatch, not session re-creation), point DTO location reference to the manager data/CLAUDE.md rule, mark retry_state as an API-layer resolver. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address second-pass review: - Make the BATCH session-type guard explicit at step 1 of the decision flow. SessionEventHandler is shared across BATCH / INTERACTIVE / INFERENCE; without the guard, handle_session_failure would fire for INFERENCE failures too (the same handler already has INFERENCE-specific routing logic at line 210), violating the stated v1 scope. - Pin the status_data["error"] contract to manager/exceptions.py: convert_to_status_data and the ErrorStatusInfo/ErrorDetail TypedDicts (line 97). classify_failure reads error.name and error.src to map to a RetryEligibleCause; without this pin, the classifier would depend on an undocumented shape. Reviewer also confirmed no duplication with pre-existing features (RestartTracker, scheduler retriable_statuses, BEP-1049 deployment retry) -- all PARALLEL, none cover session-level retry of BATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three real bugs from third-pass review confirmed against the code: 1) BackgroundTaskManager.start_retriable (common/bgtask/bgtask.py:444) does NOT accept a delay parameter -- it fires immediately via asyncio.create_task. The "retriable" name refers to the task body retrying on failure, not delayed scheduling. The BEP picked this primitive based on its name alone. Replace with a durable session_retry_dispatch_queue table + periodic claim worker (outbox pattern). Survives manager restarts; idempotency matches the sessions-table partial unique index. 2) Adding a sibling handle_session_failure method on SessionEventHandler would have created a second handler on SessionFailureAnycastEvent (already subscribed by handle_batch_result at dispatch.py:520), racing against existing bookkeeping with undefined ordering. Fold the retry decision INTO handle_batch_result instead, in the SessionFailureAnycastEvent arm. 3) The "kill switch via cluster default" was contradictory: per-session policy wins on merge, so any user setting max_retries:N bypassed it. Split into two separate etcd keys: retry_policy_default (a default, merged) and retry_disabled (boolean, checked at the top of the decision flow before merge -- a true kill switch). Also: drop the creation_id retry-suffix idea entirely (creation_id is String(32) and could overflow). The partial unique index on (parent_session_id, retry_count) is the only idempotency boundary now; creation_id stays a per-attempt random token. Implementation Plan grew from 6 to 7 PRs (queue + dispatcher worker is its own PR); estimate bumped from 3-4 to 4-5 weeks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three concrete implementation gaps from the latest hostile review: 1) Kill switch read pattern: changed from "read etcd at top of decision flow" (hot-path etcd read on every batch failure) to "loaded at startup, refreshed via existing EtcdConfigWatcher (config/provider.py: 20)." Also extends the kill-switch check to the dispatcher worker before claiming a queue row, so flipping the switch mid-incident halts in-flight queued retries. 2) Queue claim deadlock: pinned the SQL to single-row claim using "FOR UPDATE SKIP LOCKED LIMIT 1" inside an UPDATE-from-SELECT. Multiple manager replicas can now claim disjoint rows without contending on the same lock. Sentinel value for failed dispatch ('1970-01-01' timestamptz) made explicit. 3) classify_failure malformed-input fallback: explicitly does NOT default to UNKNOWN when status_data is missing or has missing required keys; instead returns a never-retriable sentinel and logs a WARNING. Only well-formed failures with unrecognized error.name map to UNKNOWN. Prevents retry storms from serialization bugs. Also: dispatcher worker now has a concrete proposed location (sokovan/scheduler/retry_dispatcher.py). Reviewer also confirmed (fourth pass) no duplication with existing queue/outbox patterns: SessionDependencyRow exists but is for kernel deps, not scheduling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reviewer feedback (paraphrased): - "max_retries-style closed enum on the manager side breaks extensibility (seen this fail before with hardcoded runtime classification)." - "Most batch retry should live on the agent." - "Resource/node-level failures (OOM, disconnect) should be rescheduled to a different node, not retried in place; don't mutate resource allocation." The original BEP-1053 stacked all of this into a single per-session RetryPolicy + queue + child sessions. Pivot to two narrower BEPs that ship independently: BEP-1053 (re-scoped): "Agent-level Batch Retry" - batch_retries / batch_retry_delay knobs on session creation - agent re-runs the entrypoint inside the same kernel - no manager-side state, no new tables, no new events - ~100 lines, smallest possible delta on Agent.execute_batch BEP-1054 (new): "Session Rescheduling on Terminal Failure" - new RescheduleFailedBatchSessionsLifecycleHandler under sokovan - reuses phase_attempts (no new counter), SERVICE_MAX_RETRIES (now made configurable per scaling group, closes its FIXME) - extends the existing expired -> PENDING transition pattern to fire from terminal-failure with a node-level cause - failure classification is etcd pattern config (extensible), not a closed enum in code - same SessionRow, same allocation; no parent_session_id, no child sessions, no queue table The two BEPs compose: agent-side script retries first; if all attempts fail and the cause is node-level, the scheduler reschedules to a fresh node; on the new node, agent-side retries run again. Each attempt is recorded in scheduling history. Registry updated, news fragment rewritten. Pivot rationale captured at docs/investigation/bep-1053-design-pivot.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings April 27, 2026 01:18

rapsealk added this to the 26.5 milestone Apr 27, 2026

github-actions Bot assigned rapsealk Apr 27, 2026

github-actions Bot added the size:L 100~500 LoC label Apr 27, 2026

Copilot started reviewing on behalf of rapsealk April 27, 2026 01:18 View session

rapsealk linked an issue Apr 27, 2026 that may be closed by this pull request

Author BEP for native session retry #11321

Open

rapsealk changed the title ~~docs: BEP-1053 native session retry~~ docs(BA-5851): Add BEP-1053 for native session retry Apr 27, 2026

docs(BA-5851): add news fragment for BEP-1053

ed3c61a

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI reviewed Apr 27, 2026

View reviewed changes

rapsealk and others added 6 commits April 27, 2026 11:10

rapsealk requested a review from HyeockJinKim April 27, 2026 03:16

rapsealk changed the title ~~docs(BA-5851): Add BEP-1053 for native session retry~~ docs(BA-5851): Add BEP-1053 (agent batch retry) and BEP-1054 (session rescheduling) Apr 27, 2026

rapsealk mentioned this pull request Apr 27, 2026

Author BEP for native session retry #11321

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(BA-5851): Add BEP-1053 (agent batch retry) and BEP-1054 (session rescheduling)#11322

docs(BA-5851): Add BEP-1053 (agent batch retry) and BEP-1054 (session rescheduling)#11322
rapsealk wants to merge 9 commits intomainfrom
bep/1053-native-session-retry

rapsealk commented Apr 27, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rapsealk commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rapsealk commented Apr 27, 2026 •

edited

Loading