Skip to content

feat(BA-6600): return per-requirement selection failures with suggestions#12448

Merged
fregataa merged 6 commits into
mainfrom
feat/BA-6600-dry-run-selector-types
Jul 2, 2026
Merged

feat(BA-6600): return per-requirement selection failures with suggestions#12448
fregataa merged 6 commits into
mainfrom
feat/BA-6600-dry-run-selector-types

Conversation

@fregataa

@fregataa fregataa commented Jun 29, 2026

Copy link
Copy Markdown
Member

Summary

  • Make select_agents_for_batch_requirements() evaluate every requirement and return BatchAgentSelectionResult (successes + per-requirement failures) instead of aborting on the first unplaceable kernel. Real scheduling re-raises an aggregated BatchAgentSelectionFailedError to keep its all-or-nothing semantics.
  • Add build_suggestion() -> Suggestion to every AgentSelectionError so a dry-run can surface structured remediation (reduce resource/container, change arch, revise designated agents) without parsing messages.
  • Give NoCompatibleAgentError structured architecture context and add InsufficientResourcesError.deficit() (per-slot shortage); NoAvailableAgentError computes the node-axis MIN reduction.

Test plan

  • pants test tests/unit/manager/sokovan/scheduler::
  • New test_build_suggestion.py covers deficit, arch/reduce/container/designated/none suggestions, node-axis MIN
  • Updated selector/provisioner/scheduler tests reflect the new return type

Resolves BA-6600

…ions

Make the batch agent selector evaluate every requirement and report each
as a success or a structured failure, instead of aborting on the first
unplaceable kernel.

- select_agents_for_batch_requirements() now returns BatchAgentSelectionResult
  (selections + per-requirement failures) and no longer raises on partial
  failure; real scheduling re-raises an aggregated BatchAgentSelectionFailedError.
- Each AgentSelectionError exposes build_suggestion() -> Suggestion (reduce
  resource/container, change arch, revise designated agents) so a dry-run can
  surface remediation without parsing messages.
- NoCompatibleAgentError carries structured architecture context;
  InsufficientResourcesError.deficit() computes per-slot shortage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fregataa fregataa requested a review from a team as a code owner June 29, 2026 13:46
@github-actions github-actions Bot added size:XL 500~ LoC comp:manager Related to Manager component labels Jun 29, 2026
@fregataa fregataa marked this pull request as draft June 29, 2026 13:53
…ypes

Move the remediation value types (Suggestion, SuggestionKind) into a
dedicated selectors/types.py that depends only on common value types, so
exceptions.py can return a Suggestion without importing selection logic.
The selection result types (RequirementSelectionFailure,
BatchAgentSelectionResult) stay in selector.py since they depend on
AgentSelection/ResourceRequirements defined there.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fregataa fregataa force-pushed the feat/BA-6600-dry-run-selector-types branch from 67cd539 to 27f7566 Compare June 29, 2026 14:21
@fregataa fregataa marked this pull request as ready for review June 29, 2026 14:24
Comment on lines +76 to +79
def build_suggestion(self) -> Suggestion:
# The resource group has no candidate agents at all; nothing the caller
# can adjust on the request itself fixes this.
return Suggestion(kind=SuggestionKind.NONE)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add @override.

Comment on lines +141 to +143
if any(isinstance(err, ContainerLimitExceededError) for err in self._agent_errors.values()):
# No node is short on slots; the blocker is the per-agent container limit.
return Suggestion(kind=SuggestionKind.REDUCE_CONTAINER)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of handling specific errors, should we use a specific base exception?

Comment on lines +119 to +124
@dataclass
class RequirementSelectionFailure:
"""A requirement that could not be placed, paired with its structured reason."""

resource_requirement: ResourceRequirements
error: AgentSelectionError

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to include the exception itself in the dataclass?

Comment on lines +216 to +217
_required_architecture: str
_available_architectures: Sequence[str]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears that the _required_architecture value is saved but not used later on.

error=e,
)
)
continue

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're only dealing with resource_req anyway, it seems like you should be able to tell which resource is in short supply without having to iterate through the data. What behavior were you intending?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each candidate node has different remaining slots, so the shortage (requested − remaining) differs per node — it can't be derived from resource_req alone. And when a kernel fails mid-batch, the per-node remaining shifts as earlier kernels' placements are already accumulated, so the excess for a later kernel can differ from an earlier one. We therefore iterate the candidate nodes to pick the one needing the smallest reduction.

…or field

Address review feedback:
- Add @OverRide to overriding error_code/failed_predicates/build_suggestion
  methods in the selection exception hierarchy.
- Remove NoCompatibleAgentError._required_architecture, which was stored but
  never read (the message uses the constructor argument directly).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fregataa fregataa marked this pull request as draft June 30, 2026 06:26
fregataa and others added 2 commits June 30, 2026 17:22
ResourceRequirements is a pure value type (depends only on common types), so
relocate it from selector.py to types.py and update all importers. This lets
lower-layer modules (e.g. exceptions) reference it without importing selection
logic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Rename Suggestion -> RemediationHint and drop the kind discriminator: any
  subset of fields may be populated, plus required_container_reduction.
- TrackerCompatibilityError.remediation_hint_contribution() lets each node-level
  error contribute its partial hint; NoAvailableAgentError merges them (no
  isinstance branching).
- Add RequirementSelectionError base carrying the failed resource_requirement;
  NoAvailableAgentError / NoCompatibleAgentError extend it.
- select_agents_for_batch_requirements() returns list[AgentSelection] and raises
  BatchAgentSelectionFailedError (public .errors) when any requirement cannot be
  placed, so a dry-run can catch it for per-kernel feedback.
- Remove unused ArchitectureIncompatibleError.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@fregataa fregataa marked this pull request as ready for review June 30, 2026 10:03
@fregataa fregataa requested a review from a team June 30, 2026 10:04
@fregataa fregataa merged commit 388758a into main Jul 2, 2026
33 checks passed
@fregataa fregataa deleted the feat/BA-6600-dry-run-selector-types branch July 2, 2026 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:manager Related to Manager component size:XL 500~ LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants