Skip to content

docs(BA-5851): Add BEP-1053 (agent batch retry) and BEP-1054 (session rescheduling)#11322

Open
rapsealk wants to merge 9 commits intomainfrom
bep/1053-native-session-retry
Open

docs(BA-5851): Add BEP-1053 (agent batch retry) and BEP-1054 (session rescheduling)#11322
rapsealk wants to merge 9 commits intomainfrom
bep/1053-native-session-retry

Conversation

@rapsealk
Copy link
Copy Markdown
Member

@rapsealk rapsealk commented Apr 27, 2026

Summary

Adds BEP-1053, capturing the design for native session-level retry in Backend.AI core.

  • RetryPolicy Pydantic schema modeled on Apache Airflow's parameter surface, with deviations documented.
  • Failure classification with structural RetryEligibleCause enum and hardcoded never-retry causes.
  • Decision/dispatch flow via a new event handler with idempotent child creation_id.
  • Linked-list retry chain on SessionRow (no new history table, no new status).
  • REST v2 / GraphQL v2 / SDK v2 / CLI v2 surface plan.
  • Migration is additive, backportable per the Alembic README, and default max_retries=0 preserves current behavior.

Implementation is broken into the remaining sub-issues under the epic.

Refs: #11320 (epic), #11321 (BA-5851) (this sub-issue).

Test plan

  • BEP rendered correctly on GitHub.
  • Registry entry resolves (BEP-1053-native-session-retry.md link).
  • No conflicting BEP number with concurrently merged proposals.
  • Reviewers confirm the schema, classification table, and decision log capture the agreed design.

🤖 Generated with Claude Code

Captures the design for adding native session-level retry to Backend.AI
core, modeled after Apache Airflow's RetryPolicy parameter surface and
adapted to Backend.AI's event-driven architecture. Subsequent
implementation work tracked under epic.

Refs: #11320, #11321

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings April 27, 2026 01:18
@rapsealk rapsealk added this to the 26.5 milestone Apr 27, 2026
@github-actions github-actions Bot added the size:L 100~500 LoC label Apr 27, 2026
@rapsealk rapsealk linked an issue Apr 27, 2026 that may be closed by this pull request
@rapsealk rapsealk changed the title docs: BEP-1053 native session retry docs(BA-5851): Add BEP-1053 for native session retry Apr 27, 2026
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new BEP proposal (BEP-1053) documenting the design for native session-level retry in Backend.AI core, and registers the BEP number in the proposals index.

Changes:

  • Reserve BEP number 1053 in the proposals registry.
  • Add proposals/BEP-1053-native-session-retry.md describing retry policy schema, failure classification, dispatch flow, data model, API surface, and rollout plan.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 8 comments.

File Description
proposals/README.md Adds BEP-1053 entry to the BEP registry table.
proposals/BEP-1053-native-session-retry.md Introduces the BEP draft for native session retry design and planned surface/migration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread proposals/BEP-1053-native-session-retry.md Outdated
Comment thread proposals/BEP-1053-native-session-retry.md Outdated
Comment thread proposals/BEP-1053-native-session-retry.md Outdated
Comment thread proposals/BEP-1053-native-session-retry.md Outdated
Comment thread proposals/BEP-1053-native-session-retry.md Outdated
Comment thread proposals/BEP-1053-native-session-retry.md Outdated
Comment thread proposals/BEP-1053-native-session-retry.md Outdated
Comment thread proposals/BEP-1053-native-session-retry.md Outdated
rapsealk and others added 6 commits April 27, 2026 11:10
Verified code paths against latest main:
- SessionStatus enum at data/session/types.py:30-50; terminal_statuses() at line 109
- SessionEventHandler at event_dispatcher/handlers/session.py:52, with handlers
  for started/cancelled/terminating/terminated; status_data["error"] already
  consulted in handle_session_terminated
- SessionService.create_from_params at services/session/service.py:255
- SessionRow.creation_id at models/session/row.py:389-390

Drop redundant tables (Decision Log, Airflow-mapping) and the Open Questions
section to match the format used by recent BEPs (BEP-1049, BEP-1050).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the Django-style select_for_update() reference with SQLAlchemy 2.x
syntax matching the existing pattern in repositories/agent/db_source.py and
repositories/deployment/db_source.py: sa.select(SessionRow).where(...)
.with_for_update() inside begin_session().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address production-readiness review findings:

- Add partial unique index on (parent_session_id, retry_count) as the real
  idempotency guarantee for retry dispatch. The parent row lock alone is
  insufficient because creation_id has unique=False
  (models/session/row.py:390); under concurrent handlers the second INSERT
  would have succeeded silently. The unique index makes duplicate child
  creation a hard failure.

- Commit to extending SessionEventHandler in event_dispatcher/handlers/
  session.py for the retry decision, instead of leaving "two equivalent
  integration points." Sokovan post-processors run during scheduling
  iterations, which complicates idempotency without adding capability.
  Also notes the recent sokovan refactor (#11250 / 8321c79) so future
  readers see why the post-processor path was rejected.

- Specifically name BackgroundTaskManager.start_retriable() (already
  injected into SessionService at service.py:245,408) as the dispatch
  primitive instead of vague "background task / event mechanism."

- Defer project/domain default layer to a follow-up after BEP-1052
  (Scoped App Config Redesign) lands, so this BEP doesn't conflict with
  in-flight config-surface work.

- Document the retry handler's own failure modes (classify_failure raise,
  start_retriable enqueue failure) and the accounting policy (each retry
  counts against quota; no refund).

- Clarify retriable_statuses() is unrelated (in-session re-dispatch, not
  session re-creation), point DTO location reference to the manager
  data/CLAUDE.md rule, mark retry_state as an API-layer resolver.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address second-pass review:

- Make the BATCH session-type guard explicit at step 1 of the decision
  flow. SessionEventHandler is shared across BATCH / INTERACTIVE /
  INFERENCE; without the guard, handle_session_failure would fire for
  INFERENCE failures too (the same handler already has INFERENCE-specific
  routing logic at line 210), violating the stated v1 scope.

- Pin the status_data["error"] contract to manager/exceptions.py:
  convert_to_status_data and the ErrorStatusInfo/ErrorDetail TypedDicts
  (line 97). classify_failure reads error.name and error.src to map to a
  RetryEligibleCause; without this pin, the classifier would depend on an
  undocumented shape.

Reviewer also confirmed no duplication with pre-existing features
(RestartTracker, scheduler retriable_statuses, BEP-1049 deployment
retry) -- all PARALLEL, none cover session-level retry of BATCH.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three real bugs from third-pass review confirmed against the code:

1) BackgroundTaskManager.start_retriable (common/bgtask/bgtask.py:444)
   does NOT accept a delay parameter -- it fires immediately via
   asyncio.create_task. The "retriable" name refers to the task body
   retrying on failure, not delayed scheduling. The BEP picked this
   primitive based on its name alone. Replace with a durable
   session_retry_dispatch_queue table + periodic claim worker (outbox
   pattern). Survives manager restarts; idempotency matches the
   sessions-table partial unique index.

2) Adding a sibling handle_session_failure method on SessionEventHandler
   would have created a second handler on SessionFailureAnycastEvent
   (already subscribed by handle_batch_result at dispatch.py:520),
   racing against existing bookkeeping with undefined ordering. Fold
   the retry decision INTO handle_batch_result instead, in the
   SessionFailureAnycastEvent arm.

3) The "kill switch via cluster default" was contradictory: per-session
   policy wins on merge, so any user setting max_retries:N bypassed it.
   Split into two separate etcd keys: retry_policy_default (a default,
   merged) and retry_disabled (boolean, checked at the top of the
   decision flow before merge -- a true kill switch).

Also: drop the creation_id retry-suffix idea entirely (creation_id is
String(32) and could overflow). The partial unique index on
(parent_session_id, retry_count) is the only idempotency boundary now;
creation_id stays a per-attempt random token.

Implementation Plan grew from 6 to 7 PRs (queue + dispatcher worker is
its own PR); estimate bumped from 3-4 to 4-5 weeks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three concrete implementation gaps from the latest hostile review:

1) Kill switch read pattern: changed from "read etcd at top of decision
   flow" (hot-path etcd read on every batch failure) to "loaded at
   startup, refreshed via existing EtcdConfigWatcher (config/provider.py:
   20)." Also extends the kill-switch check to the dispatcher worker
   before claiming a queue row, so flipping the switch mid-incident
   halts in-flight queued retries.

2) Queue claim deadlock: pinned the SQL to single-row claim using
   "FOR UPDATE SKIP LOCKED LIMIT 1" inside an UPDATE-from-SELECT.
   Multiple manager replicas can now claim disjoint rows without
   contending on the same lock. Sentinel value for failed dispatch
   ('1970-01-01' timestamptz) made explicit.

3) classify_failure malformed-input fallback: explicitly does NOT
   default to UNKNOWN when status_data is missing or has missing
   required keys; instead returns a never-retriable sentinel and logs
   a WARNING. Only well-formed failures with unrecognized error.name
   map to UNKNOWN. Prevents retry storms from serialization bugs.

Also: dispatcher worker now has a concrete proposed location
(sokovan/scheduler/retry_dispatcher.py).

Reviewer also confirmed (fourth pass) no duplication with existing
queue/outbox patterns: SessionDependencyRow exists but is for kernel
deps, not scheduling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk rapsealk requested a review from HyeockJinKim April 27, 2026 03:16
Reviewer feedback (paraphrased):
- "max_retries-style closed enum on the manager side breaks
   extensibility (seen this fail before with hardcoded runtime
   classification)."
- "Most batch retry should live on the agent."
- "Resource/node-level failures (OOM, disconnect) should be rescheduled
   to a different node, not retried in place; don't mutate resource
   allocation."

The original BEP-1053 stacked all of this into a single per-session
RetryPolicy + queue + child sessions. Pivot to two narrower BEPs that
ship independently:

  BEP-1053 (re-scoped): "Agent-level Batch Retry"
    - batch_retries / batch_retry_delay knobs on session creation
    - agent re-runs the entrypoint inside the same kernel
    - no manager-side state, no new tables, no new events
    - ~100 lines, smallest possible delta on Agent.execute_batch

  BEP-1054 (new): "Session Rescheduling on Terminal Failure"
    - new RescheduleFailedBatchSessionsLifecycleHandler under sokovan
    - reuses phase_attempts (no new counter), SERVICE_MAX_RETRIES
      (now made configurable per scaling group, closes its FIXME)
    - extends the existing expired -> PENDING transition pattern to
      fire from terminal-failure with a node-level cause
    - failure classification is etcd pattern config (extensible),
      not a closed enum in code
    - same SessionRow, same allocation; no parent_session_id, no
      child sessions, no queue table

The two BEPs compose: agent-side script retries first; if all attempts
fail and the cause is node-level, the scheduler reschedules to a fresh
node; on the new node, agent-side retries run again. Each attempt is
recorded in scheduling history.

Registry updated, news fragment rewritten. Pivot rationale captured at
docs/investigation/bep-1053-design-pivot.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk rapsealk changed the title docs(BA-5851): Add BEP-1053 for native session retry docs(BA-5851): Add BEP-1053 (agent batch retry) and BEP-1054 (session rescheduling) Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Author BEP for native session retry

2 participants