feat(BA-6600): return per-requirement selection failures with suggestions#12448
Conversation
…ions Make the batch agent selector evaluate every requirement and report each as a success or a structured failure, instead of aborting on the first unplaceable kernel. - select_agents_for_batch_requirements() now returns BatchAgentSelectionResult (selections + per-requirement failures) and no longer raises on partial failure; real scheduling re-raises an aggregated BatchAgentSelectionFailedError. - Each AgentSelectionError exposes build_suggestion() -> Suggestion (reduce resource/container, change arch, revise designated agents) so a dry-run can surface remediation without parsing messages. - NoCompatibleAgentError carries structured architecture context; InsufficientResourcesError.deficit() computes per-slot shortage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ypes Move the remediation value types (Suggestion, SuggestionKind) into a dedicated selectors/types.py that depends only on common value types, so exceptions.py can return a Suggestion without importing selection logic. The selection result types (RequirementSelectionFailure, BatchAgentSelectionResult) stay in selector.py since they depend on AgentSelection/ResourceRequirements defined there. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
67cd539 to
27f7566
Compare
| def build_suggestion(self) -> Suggestion: | ||
| # The resource group has no candidate agents at all; nothing the caller | ||
| # can adjust on the request itself fixes this. | ||
| return Suggestion(kind=SuggestionKind.NONE) |
There was a problem hiding this comment.
Please add @override.
| if any(isinstance(err, ContainerLimitExceededError) for err in self._agent_errors.values()): | ||
| # No node is short on slots; the blocker is the per-agent container limit. | ||
| return Suggestion(kind=SuggestionKind.REDUCE_CONTAINER) |
There was a problem hiding this comment.
Instead of handling specific errors, should we use a specific base exception?
| @dataclass | ||
| class RequirementSelectionFailure: | ||
| """A requirement that could not be placed, paired with its structured reason.""" | ||
|
|
||
| resource_requirement: ResourceRequirements | ||
| error: AgentSelectionError |
There was a problem hiding this comment.
Would it be better to include the exception itself in the dataclass?
| _required_architecture: str | ||
| _available_architectures: Sequence[str] |
There was a problem hiding this comment.
It appears that the _required_architecture value is saved but not used later on.
| error=e, | ||
| ) | ||
| ) | ||
| continue |
There was a problem hiding this comment.
If you're only dealing with resource_req anyway, it seems like you should be able to tell which resource is in short supply without having to iterate through the data. What behavior were you intending?
There was a problem hiding this comment.
Each candidate node has different remaining slots, so the shortage (requested − remaining) differs per node — it can't be derived from resource_req alone. And when a kernel fails mid-batch, the per-node remaining shifts as earlier kernels' placements are already accumulated, so the excess for a later kernel can differ from an earlier one. We therefore iterate the candidate nodes to pick the one needing the smallest reduction.
…or field Address review feedback: - Add @OverRide to overriding error_code/failed_predicates/build_suggestion methods in the selection exception hierarchy. - Remove NoCompatibleAgentError._required_architecture, which was stored but never read (the message uses the constructor argument directly). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ResourceRequirements is a pure value type (depends only on common types), so relocate it from selector.py to types.py and update all importers. This lets lower-layer modules (e.g. exceptions) reference it without importing selection logic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Rename Suggestion -> RemediationHint and drop the kind discriminator: any subset of fields may be populated, plus required_container_reduction. - TrackerCompatibilityError.remediation_hint_contribution() lets each node-level error contribute its partial hint; NoAvailableAgentError merges them (no isinstance branching). - Add RequirementSelectionError base carrying the failed resource_requirement; NoAvailableAgentError / NoCompatibleAgentError extend it. - select_agents_for_batch_requirements() returns list[AgentSelection] and raises BatchAgentSelectionFailedError (public .errors) when any requirement cannot be placed, so a dry-run can catch it for per-kernel feedback. - Remove unused ArchitectureIncompatibleError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summary
select_agents_for_batch_requirements()evaluate every requirement and returnBatchAgentSelectionResult(successes + per-requirement failures) instead of aborting on the first unplaceable kernel. Real scheduling re-raises an aggregatedBatchAgentSelectionFailedErrorto keep its all-or-nothing semantics.build_suggestion() -> Suggestionto everyAgentSelectionErrorso a dry-run can surface structured remediation (reduce resource/container, change arch, revise designated agents) without parsing messages.NoCompatibleAgentErrorstructured architecture context and addInsufficientResourcesError.deficit()(per-slot shortage);NoAvailableAgentErrorcomputes the node-axis MIN reduction.Test plan
pants test tests/unit/manager/sokovan/scheduler::test_build_suggestion.pycovers deficit, arch/reduce/container/designated/none suggestions, node-axis MINResolves BA-6600