Skip to content

fix(BA-5978): convert pydantic validation errors to per-domain 400 errors#11514

Draft
jopemachine wants to merge 14 commits intomainfrom
fix/BA-5978-pydantic-validation-domain-errors
Draft

fix(BA-5978): convert pydantic validation errors to per-domain 400 errors#11514
jopemachine wants to merge 14 commits intomainfrom
fix/BA-5978-pydantic-validation-domain-errors

Conversation

@jopemachine
Copy link
Copy Markdown
Member

@jopemachine jopemachine commented May 8, 2026

Summary

  • Catch pydantic.ValidationError raised at service / business-logic call sites and translate it into per-domain BackendAIError subclasses, so bad user payloads return HTTP 400 with structured field-level errors instead of bubbling up as HTTP 500.
  • Each call site gets its own domain-specific error class with a unique error_type and error_code, so clients can tell at a glance what went wrong.
  • Improve common parsing-layer messages: preserve body / query / header / path location prefixes; keep every field error instead of just the first.

Resolves BA-5978.

Changes

New domain error classes

Class Location Domain / status
InvalidSessionCreationConfig errors/kernel.py SESSION / 400
InvalidResourceOpts errors/kernel.py KERNEL / 400
InvalidScalingGroupOpts errors/resource.py SCALING_GROUP / 400
InvalidManagerConfig errors/resource.py BACKENDAI / 400
PydanticValidationError common/exception.py API / 400
InvalidGraphQLPydanticInput manager/errors/api.py API / 400

Each carries a from_pydantic(exc) classmethod that produces a one-line summary like body.user.email: invalid email; body.age: must be >= 0 plus a structured extra_data.errors list with loc / msg / type per field.

Helper

format_pydantic_validation_errors() in common/exception.py — single helper used by every domain error to keep formatting consistent.

Wrapped service-layer call sites

File Site Domain error
api/rest/session/handler.py _validate_creation_config (V1-V7, 12 sites) InvalidSessionCreationConfig
services/session/service.py :1574 (ResourceOpts) InvalidResourceOpts
services/model_serving/services/model_serving.py :622 (ResourceOpts) InvalidResourceOpts
registry.py :1020 (ResourceOpts) InvalidResourceOpts
services/model_card/service.py :186 (ModelDefinition) ModelCardParseError (existing) — improved structured message
api/gql_legacy/scaling_group.py :683, :728 (ScalingGroupOpts) InvalidScalingGroupOpts
api/gql_legacy/service_config.py :200 (ManagerUnifiedConfig) InvalidManagerConfig

Common parsing-layer message improvements

  • common/api_handlers.py: BodyParam / QueryParam / HeaderParam / PathParam decorators now raise PydanticValidationError with the appropriate location_prefix.
  • manager/api/utils.py: pydantic_params_api_handler keeps every field error instead of just the first.

Out of scope

  • External system response parsing (clients/storage_proxy/*, clients/agent/*, etc.) — not user input; 502/500 is appropriate.
  • Bootstrap config loading, DB row deserialization, internal event serialization — unrelated to user response codes.
  • Blanket wrap of api/gql/pydantic_compat.py to_pydantic() — would funnel every GQL input through one generic error and defeat the per-domain split.

Test plan

  • pants check and pants lint pass on touched files (already verified locally)
  • Existing unit tests under tests/unit/manager/** still pass (verified by pre-push hook)
  • Manual: send a malformed session-creation payload through ./bai and confirm 400 + structured data.errors in the response
  • Manual: send an invalid ScalingGroupOpts via the legacy GQL mutation and confirm 400 + per-field message

🤖 Generated with Claude Code

Catch pydantic.ValidationError at service / business-logic call sites
and translate it into per-domain BackendAIError subclasses so that bad
user payloads return HTTP 400 with structured, human-readable field
errors instead of bubbling up as HTTP 500.

* Add new domain error classes:
  - InvalidSessionCreationConfig, InvalidResourceOpts (errors/kernel.py)
  - InvalidScalingGroupOpts, InvalidManagerConfig (errors/resource.py)
  - PydanticValidationError, InvalidGraphQLPydanticInput (common base)
* Add format_pydantic_validation_errors() helper in common/exception.py
  that builds both a one-line summary ("body.user.email: invalid email;
  body.age: must be >= 0") and a structured per-field errors list.
* Wrap service-layer model_validate() call sites:
  - REST session handler V1-V7 CreationConfig (12 sites)
  - services/session/service.py, services/model_serving, registry.py
    (ResourceOpts)
  - services/model_card/service.py (ModelDefinition)
  - gql_legacy/scaling_group.py (ScalingGroupOpts)
  - gql_legacy/service_config.py (ManagerUnifiedConfig)
* Improve common parsing layer messages:
  - common/api_handlers.py decorators preserve location_prefix
    (body / query / header / path) when raising 400.
  - manager/api/utils.py pydantic_params_api_handler now keeps every
    field error instead of just the first one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size:L 100~500 LoC comp:manager Related to Manager component comp:common Related to Common component labels May 8, 2026
Restore the original single-arg decorator shape of
``convert_validation_error`` instead of the location-prefixed factory
introduced in the previous commit. The body still raises the new
``PydanticValidationError`` for the improved 400 response, but the
public signature stays identical to what callers had before.

Drops the ``location_prefix`` argument from the
``manager/api/utils.py`` ``pydantic_params_api_handler`` call sites
accordingly — per-domain ``from_pydantic`` helpers retain their own
default prefixes where they make sense (e.g. ``resource_opts``,
``scheduler_opts``).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jopemachine jopemachine changed the title fix: convert pydantic validation errors to per-domain 400 errors fix(BA-5978): convert pydantic validation errors to per-domain 400 errors May 8, 2026
jopemachine and others added 3 commits May 8, 2026 13:31
Promote ``_format_pydantic_loc`` in ``common/exception.py`` to a public
``format_pydantic_loc`` and adopt the bracket notation for integer
indices (``users[0].email``) so the formatting now matches the existing
``session_spec_preparer._format_loc`` convention.

Drop the duplicated static method on ``SessionSpecPreparer`` — the
finalize path now uses the shared helper, keeping the
``IncompleteSessionSpec`` message identical to before while removing
divergent loc-formatting between layers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier refactor removed the outer ``try / except ValidationError``
in ``extract_param_value`` on the assumption that all paths now convert
the exception inside their ``from_*`` decorators. That removal was out
of scope for this PR and dropped a defensive safety net for any future
code path that might surface a raw ``ValidationError`` here. Restore
the original wrapping unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restore the common-layer pydantic error handling to exactly what it was
on main:

* ``common/api_handlers.py``: ``convert_validation_error`` raises the
  original ``InvalidAPIParameters(repr(e))`` again.
* ``manager/api/utils.py``: ``pydantic_params_api_handler`` keeps the
  pre-existing first-error formatting and ``ex.json()`` extra_data.

Both already returned 400 before this PR, so the message-format change
was not needed for the original goal (per-domain 400 errors at
service-layer ``model_validate`` call sites).

Also drop two now-dead pieces:

* ``PydanticValidationError`` in ``common/exception.py`` — only the
  reverted common-layer code referenced it.
* ``InvalidGraphQLPydanticInput`` in ``manager/errors/api.py`` — never
  raised after we decided not to wrap GQL ``to_pydantic()`` globally.

The per-domain error classes, the service-layer wraps, and the shared
``format_pydantic_loc`` / ``format_pydantic_validation_errors`` helpers
remain untouched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread changes/11514.fix.md
Cover the formatter's contract using a throwaway Pydantic model
(``_DummyUser``) that intentionally fails validation in the shapes the
helper needs to handle:

* ``format_pydantic_loc``: parametrized cases for empty loc, plain
  fields, nested fields, integer indices (bracket notation), and the
  optional ``location_prefix``.
* ``format_pydantic_validation_errors``: single missing field, multiple
  errors joined with ``; ``, nested + list-indexed locs, prefix
  prepended into both summary and structured entries, and input-repr
  truncation past the 200-char cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread tests/unit/common/test_exception.py Outdated
jopemachine and others added 4 commits May 8, 2026 14:28
* Rename ``changes/.fix.md`` to ``changes/11514.fix.md`` now that the
  PR number is known, so the news fragment no longer relies on the
  ``assign-pr-number`` workflow rewrite.
* Introduce a ``_LocCase`` ``NamedTuple`` for the parametrized
  ``format_pydantic_loc`` cases — named fields make each row easier to
  read than positional 3-tuples.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR feedback: the format_pydantic_loc parametrized cases are easier to
scan when each row has named fields instead of positional 3-tuples.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each test method now leads with a one- or two-line docstring stating
exactly what behavior it locks in (single error summary, multi-error
join, dotted/bracketed loc rendering, prefix application, input
truncation, empty-payload fallback). Reading the file top-to-bottom
should now make the contract obvious without studying each assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lock the wire contract of one representative ``Invalid*`` class
introduced by this PR — the V1-V7 ``CreationConfig`` wraps in
``api/rest/session/handler.py`` all funnel into this error type, so it
acts as the canonical example of the per-domain mapping.

Two checks:

* ``status_code == 400`` and the error code is
  ``session_create_invalid-parameters`` — i.e. clients can tell "bad
  user payload" apart from generic 500s.
* ``from_pydantic`` produces both a human-readable summary that names
  every offending field under the default ``config`` prefix and a
  structured ``extra_data["errors"]`` list preserving each entry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size:XL 500~ LoC and removed size:L 100~500 LoC labels May 8, 2026
jopemachine and others added 2 commits May 8, 2026 15:16
Replace per-call-site try/except wrappers with an explicit
``BackendAIModel`` base that ships a ``bai_validate`` classmethod.
Subclasses opt in by setting ``__bai_error_class__`` and call sites
just call ``Model.bai_validate(payload)`` instead of catching
``ValidationError`` themselves.

Why ``bai_validate`` rather than overriding ``model_validate``:
``ManagerUnifiedConfig`` is reused at bootstrap time by
``config/provider.py``; auto-converting every ``model_validate`` would
surface HTTP-shaped errors in non-HTTP load paths. Keeping
``model_validate`` as plain Pydantic and adding a sibling method
preserves both call patterns explicitly.

Also lift ``from_pydantic`` onto ``BackendAIError`` itself so every
concrete error class gets the same builder for free, and remove the
duplicated implementations from ``InvalidSessionCreationConfig``,
``InvalidResourceOpts``, ``InvalidScalingGroupOpts``, and
``InvalidManagerConfig``.

Migrated models (multiple inheritance with their existing bases
preserved):

* ``ResourceOpts`` (manager/data/session/options.py) → InvalidResourceOpts
* ``ScalingGroupOpts`` (manager/models/scaling_group/row.py) → InvalidScalingGroupOpts
* ``ManagerUnifiedConfig`` (manager/config/unified.py) → InvalidManagerConfig

Call sites simplified to ``Model.bai_validate(...)`` (no try/except):

* services/session/service.py:1574 (ResourceOpts)
* services/model_serving/services/model_serving.py:622 (ResourceOpts)
* registry.py:1020 (ResourceOpts)
* api/gql_legacy/scaling_group.py:683,728 (ScalingGroupOpts)
* api/gql_legacy/service_config.py:200 (ManagerUnifiedConfig)

Tests:

* Add ``TestBackendAIModelBaiValidate`` covering success pass-through,
  domain-error conversion, and the missing-class-var safeguard.
* Update ``TestInvalidSessionCreationConfig`` to assert the shared
  ``from_pydantic`` shape (no per-class default location prefix
  anymore).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the per-domain ``__bai_error_class__`` registration mechanism in
favour of always raising the existing common ``InvalidAPIParameters``
(HTTP 400) from ``ApiPayloadModel.bai_validate``. The split into
``InvalidSessionCreationConfig`` / ``InvalidResourceOpts`` /
``InvalidScalingGroupOpts`` / ``InvalidManagerConfig`` was not earning
its keep — every concrete class shared the same response shape and
HTTP status, and the registration was extra surface area without a
real distinction at the wire level.

Changes:

* Removed ``__bai_error_class__`` ClassVar and the
  ``NotImplementedError`` safeguard from ``ApiPayloadModel``;
  ``bai_validate`` now raises ``InvalidAPIParameters.from_pydantic(e)``
  directly.
* Removed the four per-domain error classes from ``errors/kernel.py``
  and ``errors/resource.py``.
* Stripped the now-unused ``__bai_error_class__`` lines and imports
  from ``ResourceOpts``, ``ScalingGroupOpts``, ``ManagerUnifiedConfig``.
* ``_validate_creation_config`` in ``api/rest/session/handler.py``
  now raises the manager-side ``InvalidAPIParameters`` instead of
  ``InvalidSessionCreationConfig``.
* Reworked the manager error tests around the simplified contract:
  bai_validate raises ``InvalidAPIParameters`` (400) on failure,
  ``model_validate`` keeps stock Pydantic semantics for non-HTTP
  callers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added size:L 100~500 LoC and removed size:XL 500~ LoC labels May 8, 2026
jopemachine and others added 2 commits May 8, 2026 15:50
Drop the sibling ``bai_validate`` method and override
``model_validate`` / ``model_validate_json`` /
``model_validate_strings`` directly on ``ApiPayloadModel`` so call
sites just look like idiomatic Pydantic again. Each entry point wraps
its ``super()`` call to translate ``ValidationError`` into HTTP 400
``InvalidAPIParameters`` carrying the structured per-field error list.

``__init__`` is intentionally NOT overridden — Pydantic v2 routes
nested validation through the compiled ``__pydantic_validator__``
(not the classmethod), and internal default-value construction
(``Model()``, ``Model(field=...)``) keeps stock Pydantic semantics.

Trade-off: ``ManagerUnifiedConfig.model_validate`` is also called at
bootstrap (config/provider.py); a bad config file at startup now
raises ``InvalidAPIParameters`` instead of ``ValidationError``. The
process still crashes the same way — only the exception class shown
in the traceback differs, so this is a cosmetic loss in exchange for
a cleaner API everywhere else.

Call sites simplified — all ``Model.bai_validate(...)`` calls in
session / model_serving / registry / gql_legacy are back to
``Model.model_validate(...)``.

Tests rebuilt around the override contract:

* ``model_validate`` and ``model_validate_json`` raise
  ``InvalidAPIParameters`` (HTTP 400) on bad input.
* ``Model(**kwargs)`` still raises raw ``ValidationError`` to keep
  internal callers working as before.
* A plain ``BaseModel`` (no mixin) keeps stock Pydantic behavior to
  pin the override as fully opt-in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-commit ``ruff format`` rewrap left over from the previous
override commit — splits a too-long ``model_validate(..., by_name=True)``
call onto two lines. No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:common Related to Common component comp:manager Related to Manager component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant