Skip to content

feat(BA-5985): scan routes via BatchQuerier with traffic-status filter#11534

Merged
HyeockJinKim merged 19 commits into
mainfrom
feat/BA-5985-health-check-enabled-flag
May 12, 2026
Merged

feat(BA-5985): scan routes via BatchQuerier with traffic-status filter#11534
HyeockJinKim merged 19 commits into
mainfrom
feat/BA-5985-health-check-enabled-flag

Conversation

@fregataa
Copy link
Copy Markdown
Member

@fregataa fregataa commented May 9, 2026

Summary

  • Replaces DeploymentRepository.get_routes_by_statuses with search_route_datas(*, querier: BatchQuerier) so route coordinator scans funnel through the standard execute_batch_querier plumbing.
  • RouteTargetStatuses gains a traffic: list[RouteTrafficStatus] | None axis; handlers that set it constrain the scan to listed traffic statuses, others pass None and the predicate is skipped.
  • Both coordinator call sites (lifecycle handler dispatch and observer) build conditions explicitly. Pagination is NoPagination — bounded scans are left for a follow-up.

Test plan

  • pants check src/ai/backend/manager::
  • pants test tests/unit/manager/sokovan/deployment/route::

Resolves BA-5985


📚 Documentation preview 📚: https://sorna--11534.org.readthedocs.build/en/11534/


📚 Documentation preview 📚: https://sorna-ko--11534.org.readthedocs.build/ko/11534/

Copilot AI review requested due to automatic review settings May 9, 2026 19:53
@github-actions github-actions Bot added size:XL 500~ LoC comp:manager Related to Manager component comp:common Related to Common component require:db-migration Automatically set when alembic migrations are added or updated labels May 9, 2026
fregataa added a commit that referenced this pull request May 9, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR adds revision-level support for deployments that omit service.health_check, allowing routes to bypass probing while still being registered with AppProxy for traffic.

Changes:

  • Denormalizes deployment_revisions.health_check_enabled (computed at insert + backfilled via Alembic) and uses it to gate health probing.
  • Introduces handler-level filtering via RouteHealthCheckFilter, plus RouteTargetStatuses.traffic_status for query filtering.
  • Refactors AppProxy sync to use pre-fetched RouteData (including session + replica info) without re-querying route/kernel connection info.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/ai/backend/manager/sokovan/deployment/route/handlers/terminating.py Adds required health_check_filter() implementation for coordinator-driven gating.
src/ai/backend/manager/sokovan/deployment/route/handlers/service_discovery_sync.py Adds required health_check_filter() implementation for coordinator-driven gating.
src/ai/backend/manager/sokovan/deployment/route/handlers/running.py Adds required health_check_filter() implementation for coordinator-driven gating.
src/ai/backend/manager/sokovan/deployment/route/handlers/route_eviction.py Adds required health_check_filter() implementation for coordinator-driven gating.
src/ai/backend/manager/sokovan/deployment/route/handlers/provisioning.py Adds required health_check_filter() implementation for coordinator-driven gating.
src/ai/backend/manager/sokovan/deployment/route/handlers/observer/health_check.py Ensures observer only probes routes whose revision enables health checks.
src/ai/backend/manager/sokovan/deployment/route/handlers/health_check.py Filters out health-check-disabled revisions from probing handler.
src/ai/backend/manager/sokovan/deployment/route/handlers/base.py Requires handlers to define health_check_filter() contract.
src/ai/backend/manager/sokovan/deployment/route/handlers/appproxy_sync.py Registers health-check-disabled routes to AppProxy and filters by traffic_status.
src/ai/backend/manager/sokovan/deployment/route/executor.py Skips Valkey health records for disabled revisions; refactors AppProxy sync to use cached route+session data.
src/ai/backend/manager/sokovan/deployment/route/coordinator.py Routes lifecycle processing now queries by RouteTargetStatuses + RouteHealthCheckFilter; adds observer-specific query.
src/ai/backend/manager/repositories/deployment/types/endpoint.py Adds RouteSessionData and moves RouteData.session_id behind session_data.
src/ai/backend/manager/repositories/deployment/types/init.py Exports RouteSessionData.
src/ai/backend/manager/repositories/deployment/repository.py Updates repo query API to accept (target, health_check_filter); adds get_routes_for_health_observation().
src/ai/backend/manager/repositories/deployment/db_source/db_source.py Implements new query semantics (traffic + revision flag gating) and attaches session status to RouteData.
src/ai/backend/manager/repositories/deployment/creators/revision.py Computes health_check_enabled on revision insert from model_definition.
src/ai/backend/manager/models/deployment_revision/row.py Adds persisted health_check_enabled column.
src/ai/backend/manager/models/alembic/versions/a1b2c3d4e5f6_add_health_check_enabled_to_deployment_revisions.py Adds/backfills health_check_enabled from JSONB model_definition.
src/ai/backend/manager/data/deployment/types.py Adds RouteHealthCheckFilter and extends RouteTargetStatuses with traffic_status.
src/ai/backend/common/config.py Adds ModelDefinition.is_health_check_enabled().
changes/11534.feature.md Documents behavior change for health-check-omitted deployments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ai/backend/manager/repositories/deployment/db_source/db_source.py Outdated
Comment thread src/ai/backend/manager/repositories/deployment/types/endpoint.py Outdated
Comment thread src/ai/backend/manager/sokovan/deployment/route/executor.py Outdated
Comment thread src/ai/backend/manager/sokovan/deployment/route/handlers/base.py Outdated
@fregataa fregataa marked this pull request as draft May 9, 2026 20:14
@fregataa fregataa force-pushed the feat/BA-5985-health-check-enabled-flag branch 4 times, most recently from eb148d5 to 88a2f34 Compare May 11, 2026 09:25
@github-actions github-actions Bot added the area:docs Documentations label May 11, 2026
@fregataa fregataa requested a review from a team May 11, 2026 12:32
@fregataa fregataa marked this pull request as ready for review May 11, 2026 12:32
Comment on lines +139 to +149
model_definition: Mapped[ModelDefinition | None] = mapped_column(
"model_definition", PydanticColumn(ModelDefinition), nullable=True
)
# Set at insert time from ``model_definition.is_health_check_enabled()``.
health_check_enabled: Mapped[bool] = mapped_column(
"health_check_enabled",
sa.Boolean(),
nullable=False,
default=False,
server_default=sa.false(),
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should end up in a situation where the same value is accessed from multiple places.

@fregataa fregataa marked this pull request as draft May 11, 2026 16:14
@fregataa fregataa force-pushed the feat/BA-5985-health-check-enabled-flag branch 2 times, most recently from 8d92102 to 600e107 Compare May 11, 2026 18:04
@fregataa fregataa changed the title feat(BA-5985): allow deployments without health_check via deployment_revisions flag feat(BA-5985): allow deployments without service.health_check May 11, 2026
@fregataa fregataa force-pushed the feat/BA-5985-health-check-enabled-flag branch 2 times, most recently from d77cc3c to 61d6a03 Compare May 12, 2026 08:14
@fregataa fregataa marked this pull request as ready for review May 12, 2026 09:28
@fregataa fregataa force-pushed the feat/BA-5985-health-check-enabled-flag branch from 9a238e7 to 4d4e1a9 Compare May 12, 2026 09:38
@fregataa fregataa requested a review from a team May 12, 2026 09:41
fregataa and others added 4 commits May 12, 2026 22:20
…revisions.health_check_enabled flag

Persist a denormalized boolean on deployment_revisions indicating whether
the revision's model_definition declares service.health_check, computed at
insert time from ModelDefinition.is_health_check_enabled() and backfilled
for existing rows. Manager skips HTTP probing entirely for revisions that
omit the block — routes stay NOT_CHECKED for life and still register with
AppProxy so traffic flows.

- Schema + ORM column + Alembic backfill (handles JSON null vs SQL NULL)
- The boolean is the SQL-level filter; the in-memory RouteData carries the
  resolved ModelHealthCheck (or None) joined eagerly from
  deployment_revisions.model_definition, so no second per-revision fetch
  is needed when initialising health records
- Filter framework: RouteHealthCheckFilter dataclass; handler.health_check_filter()
  abstract classmethod; RouteTargetStatuses gains traffic_status
- Observer-dedicated repository entry: get_routes_for_health_observation()
- HealthCheckRouteHandler: skip revisions with health_check disabled
- AppProxySyncRouteHandler: include revisions with health_check disabled
  (HEALTHY OR disabled, plus traffic_status=ACTIVE) so unconfigured routes
  still register
- RouteData.session_data (replaces session_id field; session_id kept as
  property) carries SessionStatus so sync_appproxy can validate live sessions
- sync_appproxy uses input routes' replica_host/replica_port directly (same
  pattern as register_routes_now); removes the second route+kernel re-query
  via fetch_route_connection_infos and the related condition helper
- Skip routes whose session is not RUNNING/CREATING; error on missing
  session_data or replica info

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…delRevision

Surface the denormalized boolean already persisted on
``deployment_revisions.health_check_enabled`` through the full read
stack so clients can tell, per revision, whether the manager will run
HTTP health probes for its routes.

- ``ModelRevisionData`` carries the flag (read from ``DeploymentRevisionRow.to_data``)
- ``RevisionNode`` DTO and the deployment adapter pass it upward
- ``ModelRevision`` GQL exposes ``health_check_enabled`` via
  ``gql_added_field`` with ``NEXT_RELEASE_VERSION``

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ``TestGetRoutesByStatuses`` with a single ``populated_routes``
fixture that seeds one endpoint, two revisions (one with
``health_check_enabled=True`` carrying a ``/health`` block, one with
``False`` and no ``model_definition``), and six routes spanning the
(lifecycle x health x traffic_status x revision-hc) matrix the sokovan
route handlers branch on.

Each test consumes only ``populated_routes`` (no inline setup) and
asserts which named routes the filter returns. Covers:

- lifecycle / health / traffic_status target filters
- ``health_check_required=True`` and ``=False`` (probe scheduling)
- ``include_health_check_disabled=True`` OR-branch (AppProxy sync)
- ``RouteData.health_check_config`` projection from ``model_definition``
- empty result when nothing matches

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
fregataa and others added 15 commits May 12, 2026 22:20
…dinator

The dedicated repository method duplicated the join/projection of
``get_routes_by_statuses`` to express what amounts to two literal
filter arguments. Call ``get_routes_by_statuses`` directly from
``_process_observer`` instead so the route fetch goes through a single
query path.

- ``RouteHealthObserver`` is fed by
  ``RouteTargetStatuses(lifecycle=[RUNNING], health=list(...))``
  combined with ``RouteHealthCheckFilter(health_check_required=True)``
  — same rows the dropped method returned.
- Removes ``DeploymentRepository.get_routes_for_health_observation()``
  and its underlying db_source query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The revision-level ``health_check_enabled`` flag stopped being a SQL
predicate: ``get_routes_by_statuses`` no longer takes a
``RouteHealthCheckFilter`` and the two handlers that previously relied
on it now narrow the route set in Python from the resolved
``RouteData.health_check_config`` (eager-joined from
``deployment_revisions.model_definition``).

- ``RouteHealthCheckFilter`` dataclass, ``RouteHandler.health_check_filter()``
  abstract classmethod, and every concrete override are removed.
- ``HealthCheckRouteHandler.execute()`` skips routes whose revision
  opted out of ``service.health_check`` — they have no
  ``RouteHealthRecord`` and would otherwise be classified as stale.
- ``AppProxySyncRouteHandler.target_statuses()`` widens ``health`` to
  ``[HEALTHY, NOT_CHECKED]`` so hc-disabled routes (stuck on
  NOT_CHECKED for life) remain candidates, and ``execute()`` filters
  the in-memory set to ``HEALTHY or health_check_config is None``.
- ``RouteHealthObserver`` already filtered by ``health_check_config``
  in memory, so the coordinator's observer feeder drops the
  ``health_check_required=True`` arg with no behavioural change.

The ORM column ``deployment_revisions.health_check_enabled`` still
exists at this point — it is just unread; a follow-up commit drops it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The revision-level boolean is being removed from the read stack: the
information it carries is fully derivable from each revision's
``model_definition`` (``ModelDefinition.is_health_check_enabled()``),
and the SQL filter that consumed the persisted flag is gone.

- Removes ``health_check_enabled`` from ``ModelRevisionData``,
  ``DeploymentRevisionRow.to_data()``, the deployment adapter,
  ``RevisionNode`` Pydantic DTO, and ``ModelRevision`` GQL type.
- Regenerates GraphQL v2 schema and supergraph dumps.

The ORM column and the writer in ``RevisionCreatorSpec`` are still
present at this point — a follow-up commit drops them along with an
Alembic migration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both the SQL filter (commit 1) and the read-stack exposure (commit 2)
are gone; nothing reads or writes this column anymore. Drop it.

- New Alembic migration ``19159046a2a4`` removes the column; its
  downgrade re-adds it and re-runs the original JSONB backfill so a
  manager rollback can keep reading the persisted flag.
- ``DeploymentRevisionRow.health_check_enabled`` ORM column and the
  ``RevisionCreatorSpec`` writer are removed.
- Test fixture stops setting the column on seeded
  ``DeploymentRevisionRow`` rows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous add/drop migration pair cancels out on this branch: the
column is unused by the final code. Deleting both files keeps the
alembic chain net-zero against ``main`` and avoids a no-op migration
pair landing on production.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two unconditional ``.where()`` calls (lifecycle, health_status) read
more naturally as positional ``where(..., ...)`` arguments on the
initial ``sa.select`` chain. SQLAlchemy treats multiple positional
predicates as AND, so this is identical semantically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dataclass field shares its name with ``RoutingRow.traffic_status``
(the SQL column) and ``RouteBatchUpdaterSpec.traffic_status`` (a writer
field) — both unrelated. Shorten to ``traffic`` so the call site reads
as a target filter axis next to ``lifecycle`` and ``health`` instead of
echoing the column name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…the repo

Bring back ``RouteHealthCheckFilter`` as a thin in-memory gate on the
revision's ``health_check_config`` presence, applied by
``DeploymentRepository.get_routes_by_statuses`` after the SQL query
materializes rows. The SQL surface stays a flat
``(lifecycle, health, traffic)`` selector — no widening, no JSON
predicates. Anything more specific (per-status conditional gates) lives
in the calling handler.

- ``RouteHealthCheckFilter`` exposes one knob: ``health_check_required``
  (``True``/``False``/``None``). The repo runs ``_passes_health_check``
  over each row.
- ``RouteHandler.health_check_filter()`` is abstract; every handler
  declares its filter explicitly. Lifecycle handlers return a no-op
  filter; ``HealthCheckRouteHandler`` returns ``health_check_required=True``.
- ``AppProxySyncRouteHandler`` widens ``target.health`` to
  ``[HEALTHY, NOT_CHECKED]`` so revisions that opted out of health_check
  (which stay NOT_CHECKED for life) remain candidates, and rejects the
  ``NOT_CHECKED + hc-enabled`` rows in ``execute`` — those routes are
  waiting for their first probe and must not be announced yet.
- Coordinator threads the handler's filter (and the observer's
  ``health_check_required=True``) into the repo call.
- Repo tests cover the (lifecycle x health x traffic) filter plus the
  ``health_check_required`` True/False post-filter cases. Handler tests
  assert the declared filter and (for AppProxy) the in-memory eligibility
  predicate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ``RouteHealthCheckFilter.health_check_required`` becomes a plain
  ``bool`` (default ``False``): ``False`` skips the filter, ``True``
  keeps only routes whose revision has a resolved health-check block.
  The previous ``Optional[bool]`` carried a third "keep only opt-out
  routes" mode no caller used and whose meaning was easy to misread
  off the field name.
- Drop ``_passes_health_check`` helper; inline the single remaining
  predicate at the call site in ``get_routes_by_statuses``.
- Remove the ``health_check_required=False`` repository test that
  exercised the dropped semantic.
- Strip the ``hc-`` shorthand from AppProxy sync handler comments and
  the repository test docstring — the abbreviation reads as jargon to
  anyone outside this branch.
- Trim verbose docstrings on ``RouteData`` (back to main's one-liner)
  and ``RouteSessionData`` (single sentence, no FK invariant note).
…fetch

``DeploymentRepository.get_routes_by_statuses`` (which carried the
``DeploymentRevisionRow.model_definition`` and ``SessionRow.status``
JOINs and the in-memory ``RouteHealthCheckFilter`` post-filter) is
split into two narrower entry points so handlers carry the cost of the
data they actually need:

- ``search_route_datas(querier: BatchQuerier)`` returns bare
  ``RouteData`` from ``RoutingRow`` alone; the coordinator builds a
  ``BatchQuerier`` from each handler's ``RouteTargetStatuses`` (now
  ``lifecycle``/``health``/optional list ``traffic``) at the call site
  with ``RouteConditions``.
- ``fetch_health_check_configs(revision_ids)`` returns a per-revision
  ``ModelHealthCheck | None`` map, called from
  ``HealthCheckRouteHandler``, ``AppProxySyncRouteHandler``,
  ``RouteHealthObserver``, and ``RouteExecutor.check_running_routes``
  to gate their per-route behaviour.

``RouteHealthCheckFilter``, ``RouteHandler.health_check_filter()`` and
every override are removed; ``RouteSessionData`` is dropped and
``RouteData.session_id`` becomes a direct field again.

In ``check_running_routes``, the ``(RouteData, ModelHealthCheck)`` pairs
are bundled into a private ``_RouteWithHealthCheck`` frozen dataclass
that flows down through ``_ensure_health_records`` and
``_initialize_health_records``; ``_populate_replica_info`` now also
mutates ``replica_host``/``replica_port`` on the in-memory routes so
the newly populated rows fall through to the single Phase-4 record
init. ``sync_appproxy`` self-fetches ``session_statuses`` instead of
relying on the dropped SQL join.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``AppProxySyncRouteHandler`` no longer fetches per-revision health-check
config; the (``HEALTHY`` OR no-probe) eligibility predicate is owned by
``RouteExecutor.sync_appproxy``, which already does the session-status
lookup for the same route set. The handler shrinks to a passthrough and
drops its ``DeploymentRepository`` dependency.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``HealthCheckRouteHandler`` shrinks to a passthrough and drops its
``DeploymentRepository`` dependency. The (``hc is not None``) filter
that keeps the probe loop from misclassifying opt-out routes as stale
now lives at the head of ``RouteExecutor.check_route_health``, alongside
the Valkey record lookup it gates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hc_pairs filter was gating ``routes_missing_replica`` as well, so
revisions that omit ``service.health_check`` never had their
``replica_host``/``replica_port`` populated. ``sync_appproxy`` would
then reject them indefinitely with "no replica connection info to
sync", breaking the BA-5985 promise that opt-out routes start
receiving traffic as soon as they reach RUNNING.

Split the two phases:

- Replica population now runs over all session-verified successes,
  hc-agnostic, since AppProxy registration needs host/port regardless
  of probing policy.
- RouteHealthRecord initialization stays hc-gated — opt-out revisions
  never get probed, so they don't need a Valkey record.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…agination

The original BA-5985 scope (allow deployments to omit
``service.health_check`` so the manager skips probing and routes
receive traffic immediately at RUNNING) is being moved out of this
ticket. This commit reverts the 17 commits on the branch that delivered
or scaffolded that capability — the ``deployment_revisions.health_check_enabled``
column and its GQL/DTO/adapter exposure, the ``RouteHealthCheckFilter``
variants, all hc-eligibility filtering inside route handlers, the
``RouteTargetStatuses.traffic`` field (added as opt-out scaffolding),
and the BatchQuerier-based per-handler hc-config fetch.

Net tree state equals ``origin/main``. ``changes/11534.feature.md``
is rewritten to describe the new (narrower) scope: a traffic-status
filter axis on route handlers, BatchQuerier adoption in the
coordinator's lifecycle scan, and pagination on that scan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lter axis

Replaces the bespoke ``DeploymentRepository.get_routes_by_statuses``
with ``search_route_datas(*, querier: BatchQuerier)`` so the route
coordinator's lifecycle and observer scans funnel through the same
``execute_batch_querier`` plumbing as every other deployment search.
``RouteTargetStatuses`` gains a ``traffic: list[RouteTrafficStatus]
| None`` axis; handlers that set it constrain the scan to the listed
``traffic_status`` values, others pass ``None`` and the filter is
skipped. Both call sites build their conditions explicitly at the
coordinator boundary and use ``NoPagination`` — bounded-scan
pagination is left for a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fregataa fregataa force-pushed the feat/BA-5985-health-check-enabled-flag branch from 4d4e1a9 to b662afd Compare May 12, 2026 13:43
@github-actions github-actions Bot added size:M 30~100 LoC and removed size:XL 500~ LoC labels May 12, 2026
@fregataa fregataa changed the title feat(BA-5985): allow deployments without service.health_check feat(BA-5985): scan routes via BatchQuerier with traffic-status filter May 12, 2026
@HyeockJinKim HyeockJinKim merged commit a79a045 into main May 12, 2026
33 of 35 checks passed
@HyeockJinKim HyeockJinKim deleted the feat/BA-5985-health-check-enabled-flag branch May 12, 2026 14:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:common Related to Common component comp:manager Related to Manager component require:db-migration Automatically set when alembic migrations are added or updated size:M 30~100 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants