docs(BA-5851): Add BEP-1053 (agent batch retry) and BEP-1054 (session rescheduling)#11322
Open
docs(BA-5851): Add BEP-1053 (agent batch retry) and BEP-1054 (session rescheduling)#11322
Conversation
Captures the design for adding native session-level retry to Backend.AI core, modeled after Apache Airflow's RetryPolicy parameter surface and adapted to Backend.AI's event-driven architecture. Subsequent implementation work tracked under epic. Refs: #11320, #11321 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new BEP proposal (BEP-1053) documenting the design for native session-level retry in Backend.AI core, and registers the BEP number in the proposals index.
Changes:
- Reserve BEP number 1053 in the proposals registry.
- Add
proposals/BEP-1053-native-session-retry.mddescribing retry policy schema, failure classification, dispatch flow, data model, API surface, and rollout plan.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| proposals/README.md | Adds BEP-1053 entry to the BEP registry table. |
| proposals/BEP-1053-native-session-retry.md | Introduces the BEP draft for native session retry design and planned surface/migration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Verified code paths against latest main: - SessionStatus enum at data/session/types.py:30-50; terminal_statuses() at line 109 - SessionEventHandler at event_dispatcher/handlers/session.py:52, with handlers for started/cancelled/terminating/terminated; status_data["error"] already consulted in handle_session_terminated - SessionService.create_from_params at services/session/service.py:255 - SessionRow.creation_id at models/session/row.py:389-390 Drop redundant tables (Decision Log, Airflow-mapping) and the Open Questions section to match the format used by recent BEPs (BEP-1049, BEP-1050). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the Django-style select_for_update() reference with SQLAlchemy 2.x syntax matching the existing pattern in repositories/agent/db_source.py and repositories/deployment/db_source.py: sa.select(SessionRow).where(...) .with_for_update() inside begin_session(). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address production-readiness review findings: - Add partial unique index on (parent_session_id, retry_count) as the real idempotency guarantee for retry dispatch. The parent row lock alone is insufficient because creation_id has unique=False (models/session/row.py:390); under concurrent handlers the second INSERT would have succeeded silently. The unique index makes duplicate child creation a hard failure. - Commit to extending SessionEventHandler in event_dispatcher/handlers/ session.py for the retry decision, instead of leaving "two equivalent integration points." Sokovan post-processors run during scheduling iterations, which complicates idempotency without adding capability. Also notes the recent sokovan refactor (#11250 / 8321c79) so future readers see why the post-processor path was rejected. - Specifically name BackgroundTaskManager.start_retriable() (already injected into SessionService at service.py:245,408) as the dispatch primitive instead of vague "background task / event mechanism." - Defer project/domain default layer to a follow-up after BEP-1052 (Scoped App Config Redesign) lands, so this BEP doesn't conflict with in-flight config-surface work. - Document the retry handler's own failure modes (classify_failure raise, start_retriable enqueue failure) and the accounting policy (each retry counts against quota; no refund). - Clarify retriable_statuses() is unrelated (in-session re-dispatch, not session re-creation), point DTO location reference to the manager data/CLAUDE.md rule, mark retry_state as an API-layer resolver. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address second-pass review: - Make the BATCH session-type guard explicit at step 1 of the decision flow. SessionEventHandler is shared across BATCH / INTERACTIVE / INFERENCE; without the guard, handle_session_failure would fire for INFERENCE failures too (the same handler already has INFERENCE-specific routing logic at line 210), violating the stated v1 scope. - Pin the status_data["error"] contract to manager/exceptions.py: convert_to_status_data and the ErrorStatusInfo/ErrorDetail TypedDicts (line 97). classify_failure reads error.name and error.src to map to a RetryEligibleCause; without this pin, the classifier would depend on an undocumented shape. Reviewer also confirmed no duplication with pre-existing features (RestartTracker, scheduler retriable_statuses, BEP-1049 deployment retry) -- all PARALLEL, none cover session-level retry of BATCH. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three real bugs from third-pass review confirmed against the code: 1) BackgroundTaskManager.start_retriable (common/bgtask/bgtask.py:444) does NOT accept a delay parameter -- it fires immediately via asyncio.create_task. The "retriable" name refers to the task body retrying on failure, not delayed scheduling. The BEP picked this primitive based on its name alone. Replace with a durable session_retry_dispatch_queue table + periodic claim worker (outbox pattern). Survives manager restarts; idempotency matches the sessions-table partial unique index. 2) Adding a sibling handle_session_failure method on SessionEventHandler would have created a second handler on SessionFailureAnycastEvent (already subscribed by handle_batch_result at dispatch.py:520), racing against existing bookkeeping with undefined ordering. Fold the retry decision INTO handle_batch_result instead, in the SessionFailureAnycastEvent arm. 3) The "kill switch via cluster default" was contradictory: per-session policy wins on merge, so any user setting max_retries:N bypassed it. Split into two separate etcd keys: retry_policy_default (a default, merged) and retry_disabled (boolean, checked at the top of the decision flow before merge -- a true kill switch). Also: drop the creation_id retry-suffix idea entirely (creation_id is String(32) and could overflow). The partial unique index on (parent_session_id, retry_count) is the only idempotency boundary now; creation_id stays a per-attempt random token. Implementation Plan grew from 6 to 7 PRs (queue + dispatcher worker is its own PR); estimate bumped from 3-4 to 4-5 weeks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three concrete implementation gaps from the latest hostile review:
1) Kill switch read pattern: changed from "read etcd at top of decision
flow" (hot-path etcd read on every batch failure) to "loaded at
startup, refreshed via existing EtcdConfigWatcher (config/provider.py:
20)." Also extends the kill-switch check to the dispatcher worker
before claiming a queue row, so flipping the switch mid-incident
halts in-flight queued retries.
2) Queue claim deadlock: pinned the SQL to single-row claim using
"FOR UPDATE SKIP LOCKED LIMIT 1" inside an UPDATE-from-SELECT.
Multiple manager replicas can now claim disjoint rows without
contending on the same lock. Sentinel value for failed dispatch
('1970-01-01' timestamptz) made explicit.
3) classify_failure malformed-input fallback: explicitly does NOT
default to UNKNOWN when status_data is missing or has missing
required keys; instead returns a never-retriable sentinel and logs
a WARNING. Only well-formed failures with unrecognized error.name
map to UNKNOWN. Prevents retry storms from serialization bugs.
Also: dispatcher worker now has a concrete proposed location
(sokovan/scheduler/retry_dispatcher.py).
Reviewer also confirmed (fourth pass) no duplication with existing
queue/outbox patterns: SessionDependencyRow exists but is for kernel
deps, not scheduling.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewer feedback (paraphrased):
- "max_retries-style closed enum on the manager side breaks
extensibility (seen this fail before with hardcoded runtime
classification)."
- "Most batch retry should live on the agent."
- "Resource/node-level failures (OOM, disconnect) should be rescheduled
to a different node, not retried in place; don't mutate resource
allocation."
The original BEP-1053 stacked all of this into a single per-session
RetryPolicy + queue + child sessions. Pivot to two narrower BEPs that
ship independently:
BEP-1053 (re-scoped): "Agent-level Batch Retry"
- batch_retries / batch_retry_delay knobs on session creation
- agent re-runs the entrypoint inside the same kernel
- no manager-side state, no new tables, no new events
- ~100 lines, smallest possible delta on Agent.execute_batch
BEP-1054 (new): "Session Rescheduling on Terminal Failure"
- new RescheduleFailedBatchSessionsLifecycleHandler under sokovan
- reuses phase_attempts (no new counter), SERVICE_MAX_RETRIES
(now made configurable per scaling group, closes its FIXME)
- extends the existing expired -> PENDING transition pattern to
fire from terminal-failure with a node-level cause
- failure classification is etcd pattern config (extensible),
not a closed enum in code
- same SessionRow, same allocation; no parent_session_id, no
child sessions, no queue table
The two BEPs compose: agent-side script retries first; if all attempts
fail and the cause is node-level, the scheduler reschedules to a fresh
node; on the new node, agent-side retries run again. Each attempt is
recorded in scheduling history.
Registry updated, news fragment rewritten. Pivot rationale captured at
docs/investigation/bep-1053-design-pivot.md.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds BEP-1053, capturing the design for native session-level retry in Backend.AI core.
RetryPolicyPydantic schema modeled on Apache Airflow's parameter surface, with deviations documented.RetryEligibleCauseenum and hardcoded never-retry causes.creation_id.SessionRow(no new history table, no new status).max_retries=0preserves current behavior.Implementation is broken into the remaining sub-issues under the epic.
Refs: #11320 (epic), #11321 (BA-5851) (this sub-issue).
Test plan
BEP-1053-native-session-retry.mdlink).🤖 Generated with Claude Code