feat(BA-5985): scan routes via BatchQuerier with traffic-status filter#11534
Merged
Conversation
fregataa
added a commit
that referenced
this pull request
May 9, 2026
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR adds revision-level support for deployments that omit service.health_check, allowing routes to bypass probing while still being registered with AppProxy for traffic.
Changes:
- Denormalizes
deployment_revisions.health_check_enabled(computed at insert + backfilled via Alembic) and uses it to gate health probing. - Introduces handler-level filtering via
RouteHealthCheckFilter, plusRouteTargetStatuses.traffic_statusfor query filtering. - Refactors AppProxy sync to use pre-fetched
RouteData(including session + replica info) without re-querying route/kernel connection info.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/ai/backend/manager/sokovan/deployment/route/handlers/terminating.py | Adds required health_check_filter() implementation for coordinator-driven gating. |
| src/ai/backend/manager/sokovan/deployment/route/handlers/service_discovery_sync.py | Adds required health_check_filter() implementation for coordinator-driven gating. |
| src/ai/backend/manager/sokovan/deployment/route/handlers/running.py | Adds required health_check_filter() implementation for coordinator-driven gating. |
| src/ai/backend/manager/sokovan/deployment/route/handlers/route_eviction.py | Adds required health_check_filter() implementation for coordinator-driven gating. |
| src/ai/backend/manager/sokovan/deployment/route/handlers/provisioning.py | Adds required health_check_filter() implementation for coordinator-driven gating. |
| src/ai/backend/manager/sokovan/deployment/route/handlers/observer/health_check.py | Ensures observer only probes routes whose revision enables health checks. |
| src/ai/backend/manager/sokovan/deployment/route/handlers/health_check.py | Filters out health-check-disabled revisions from probing handler. |
| src/ai/backend/manager/sokovan/deployment/route/handlers/base.py | Requires handlers to define health_check_filter() contract. |
| src/ai/backend/manager/sokovan/deployment/route/handlers/appproxy_sync.py | Registers health-check-disabled routes to AppProxy and filters by traffic_status. |
| src/ai/backend/manager/sokovan/deployment/route/executor.py | Skips Valkey health records for disabled revisions; refactors AppProxy sync to use cached route+session data. |
| src/ai/backend/manager/sokovan/deployment/route/coordinator.py | Routes lifecycle processing now queries by RouteTargetStatuses + RouteHealthCheckFilter; adds observer-specific query. |
| src/ai/backend/manager/repositories/deployment/types/endpoint.py | Adds RouteSessionData and moves RouteData.session_id behind session_data. |
| src/ai/backend/manager/repositories/deployment/types/init.py | Exports RouteSessionData. |
| src/ai/backend/manager/repositories/deployment/repository.py | Updates repo query API to accept (target, health_check_filter); adds get_routes_for_health_observation(). |
| src/ai/backend/manager/repositories/deployment/db_source/db_source.py | Implements new query semantics (traffic + revision flag gating) and attaches session status to RouteData. |
| src/ai/backend/manager/repositories/deployment/creators/revision.py | Computes health_check_enabled on revision insert from model_definition. |
| src/ai/backend/manager/models/deployment_revision/row.py | Adds persisted health_check_enabled column. |
| src/ai/backend/manager/models/alembic/versions/a1b2c3d4e5f6_add_health_check_enabled_to_deployment_revisions.py | Adds/backfills health_check_enabled from JSONB model_definition. |
| src/ai/backend/manager/data/deployment/types.py | Adds RouteHealthCheckFilter and extends RouteTargetStatuses with traffic_status. |
| src/ai/backend/common/config.py | Adds ModelDefinition.is_health_check_enabled(). |
| changes/11534.feature.md | Documents behavior change for health-check-omitted deployments. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
eb148d5 to
88a2f34
Compare
Comment on lines
+139
to
+149
| model_definition: Mapped[ModelDefinition | None] = mapped_column( | ||
| "model_definition", PydanticColumn(ModelDefinition), nullable=True | ||
| ) | ||
| # Set at insert time from ``model_definition.is_health_check_enabled()``. | ||
| health_check_enabled: Mapped[bool] = mapped_column( | ||
| "health_check_enabled", | ||
| sa.Boolean(), | ||
| nullable=False, | ||
| default=False, | ||
| server_default=sa.false(), | ||
| ) |
Collaborator
There was a problem hiding this comment.
I don't think we should end up in a situation where the same value is accessed from multiple places.
8d92102 to
600e107
Compare
d77cc3c to
61d6a03
Compare
9a238e7 to
4d4e1a9
Compare
…revisions.health_check_enabled flag Persist a denormalized boolean on deployment_revisions indicating whether the revision's model_definition declares service.health_check, computed at insert time from ModelDefinition.is_health_check_enabled() and backfilled for existing rows. Manager skips HTTP probing entirely for revisions that omit the block — routes stay NOT_CHECKED for life and still register with AppProxy so traffic flows. - Schema + ORM column + Alembic backfill (handles JSON null vs SQL NULL) - The boolean is the SQL-level filter; the in-memory RouteData carries the resolved ModelHealthCheck (or None) joined eagerly from deployment_revisions.model_definition, so no second per-revision fetch is needed when initialising health records - Filter framework: RouteHealthCheckFilter dataclass; handler.health_check_filter() abstract classmethod; RouteTargetStatuses gains traffic_status - Observer-dedicated repository entry: get_routes_for_health_observation() - HealthCheckRouteHandler: skip revisions with health_check disabled - AppProxySyncRouteHandler: include revisions with health_check disabled (HEALTHY OR disabled, plus traffic_status=ACTIVE) so unconfigured routes still register - RouteData.session_data (replaces session_id field; session_id kept as property) carries SessionStatus so sync_appproxy can validate live sessions - sync_appproxy uses input routes' replica_host/replica_port directly (same pattern as register_routes_now); removes the second route+kernel re-query via fetch_route_connection_infos and the related condition helper - Skip routes whose session is not RUNNING/CREATING; error on missing session_data or replica info Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…delRevision Surface the denormalized boolean already persisted on ``deployment_revisions.health_check_enabled`` through the full read stack so clients can tell, per revision, whether the manager will run HTTP health probes for its routes. - ``ModelRevisionData`` carries the flag (read from ``DeploymentRevisionRow.to_data``) - ``RevisionNode`` DTO and the deployment adapter pass it upward - ``ModelRevision`` GQL exposes ``health_check_enabled`` via ``gql_added_field`` with ``NEXT_RELEASE_VERSION`` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds ``TestGetRoutesByStatuses`` with a single ``populated_routes`` fixture that seeds one endpoint, two revisions (one with ``health_check_enabled=True`` carrying a ``/health`` block, one with ``False`` and no ``model_definition``), and six routes spanning the (lifecycle x health x traffic_status x revision-hc) matrix the sokovan route handlers branch on. Each test consumes only ``populated_routes`` (no inline setup) and asserts which named routes the filter returns. Covers: - lifecycle / health / traffic_status target filters - ``health_check_required=True`` and ``=False`` (probe scheduling) - ``include_health_check_disabled=True`` OR-branch (AppProxy sync) - ``RouteData.health_check_config`` projection from ``model_definition`` - empty result when nothing matches Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: octodog <mu001@lablup.com>
…dinator The dedicated repository method duplicated the join/projection of ``get_routes_by_statuses`` to express what amounts to two literal filter arguments. Call ``get_routes_by_statuses`` directly from ``_process_observer`` instead so the route fetch goes through a single query path. - ``RouteHealthObserver`` is fed by ``RouteTargetStatuses(lifecycle=[RUNNING], health=list(...))`` combined with ``RouteHealthCheckFilter(health_check_required=True)`` — same rows the dropped method returned. - Removes ``DeploymentRepository.get_routes_for_health_observation()`` and its underlying db_source query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The revision-level ``health_check_enabled`` flag stopped being a SQL predicate: ``get_routes_by_statuses`` no longer takes a ``RouteHealthCheckFilter`` and the two handlers that previously relied on it now narrow the route set in Python from the resolved ``RouteData.health_check_config`` (eager-joined from ``deployment_revisions.model_definition``). - ``RouteHealthCheckFilter`` dataclass, ``RouteHandler.health_check_filter()`` abstract classmethod, and every concrete override are removed. - ``HealthCheckRouteHandler.execute()`` skips routes whose revision opted out of ``service.health_check`` — they have no ``RouteHealthRecord`` and would otherwise be classified as stale. - ``AppProxySyncRouteHandler.target_statuses()`` widens ``health`` to ``[HEALTHY, NOT_CHECKED]`` so hc-disabled routes (stuck on NOT_CHECKED for life) remain candidates, and ``execute()`` filters the in-memory set to ``HEALTHY or health_check_config is None``. - ``RouteHealthObserver`` already filtered by ``health_check_config`` in memory, so the coordinator's observer feeder drops the ``health_check_required=True`` arg with no behavioural change. The ORM column ``deployment_revisions.health_check_enabled`` still exists at this point — it is just unread; a follow-up commit drops it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The revision-level boolean is being removed from the read stack: the information it carries is fully derivable from each revision's ``model_definition`` (``ModelDefinition.is_health_check_enabled()``), and the SQL filter that consumed the persisted flag is gone. - Removes ``health_check_enabled`` from ``ModelRevisionData``, ``DeploymentRevisionRow.to_data()``, the deployment adapter, ``RevisionNode`` Pydantic DTO, and ``ModelRevision`` GQL type. - Regenerates GraphQL v2 schema and supergraph dumps. The ORM column and the writer in ``RevisionCreatorSpec`` are still present at this point — a follow-up commit drops them along with an Alembic migration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both the SQL filter (commit 1) and the read-stack exposure (commit 2) are gone; nothing reads or writes this column anymore. Drop it. - New Alembic migration ``19159046a2a4`` removes the column; its downgrade re-adds it and re-runs the original JSONB backfill so a manager rollback can keep reading the persisted flag. - ``DeploymentRevisionRow.health_check_enabled`` ORM column and the ``RevisionCreatorSpec`` writer are removed. - Test fixture stops setting the column on seeded ``DeploymentRevisionRow`` rows. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous add/drop migration pair cancels out on this branch: the column is unused by the final code. Deleting both files keeps the alembic chain net-zero against ``main`` and avoids a no-op migration pair landing on production. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The two unconditional ``.where()`` calls (lifecycle, health_status) read more naturally as positional ``where(..., ...)`` arguments on the initial ``sa.select`` chain. SQLAlchemy treats multiple positional predicates as AND, so this is identical semantically. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dataclass field shares its name with ``RoutingRow.traffic_status`` (the SQL column) and ``RouteBatchUpdaterSpec.traffic_status`` (a writer field) — both unrelated. Shorten to ``traffic`` so the call site reads as a target filter axis next to ``lifecycle`` and ``health`` instead of echoing the column name. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…the repo Bring back ``RouteHealthCheckFilter`` as a thin in-memory gate on the revision's ``health_check_config`` presence, applied by ``DeploymentRepository.get_routes_by_statuses`` after the SQL query materializes rows. The SQL surface stays a flat ``(lifecycle, health, traffic)`` selector — no widening, no JSON predicates. Anything more specific (per-status conditional gates) lives in the calling handler. - ``RouteHealthCheckFilter`` exposes one knob: ``health_check_required`` (``True``/``False``/``None``). The repo runs ``_passes_health_check`` over each row. - ``RouteHandler.health_check_filter()`` is abstract; every handler declares its filter explicitly. Lifecycle handlers return a no-op filter; ``HealthCheckRouteHandler`` returns ``health_check_required=True``. - ``AppProxySyncRouteHandler`` widens ``target.health`` to ``[HEALTHY, NOT_CHECKED]`` so revisions that opted out of health_check (which stay NOT_CHECKED for life) remain candidates, and rejects the ``NOT_CHECKED + hc-enabled`` rows in ``execute`` — those routes are waiting for their first probe and must not be announced yet. - Coordinator threads the handler's filter (and the observer's ``health_check_required=True``) into the repo call. - Repo tests cover the (lifecycle x health x traffic) filter plus the ``health_check_required`` True/False post-filter cases. Handler tests assert the declared filter and (for AppProxy) the in-memory eligibility predicate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ``RouteHealthCheckFilter.health_check_required`` becomes a plain ``bool`` (default ``False``): ``False`` skips the filter, ``True`` keeps only routes whose revision has a resolved health-check block. The previous ``Optional[bool]`` carried a third "keep only opt-out routes" mode no caller used and whose meaning was easy to misread off the field name. - Drop ``_passes_health_check`` helper; inline the single remaining predicate at the call site in ``get_routes_by_statuses``. - Remove the ``health_check_required=False`` repository test that exercised the dropped semantic. - Strip the ``hc-`` shorthand from AppProxy sync handler comments and the repository test docstring — the abbreviation reads as jargon to anyone outside this branch. - Trim verbose docstrings on ``RouteData`` (back to main's one-liner) and ``RouteSessionData`` (single sentence, no FK invariant note).
…fetch ``DeploymentRepository.get_routes_by_statuses`` (which carried the ``DeploymentRevisionRow.model_definition`` and ``SessionRow.status`` JOINs and the in-memory ``RouteHealthCheckFilter`` post-filter) is split into two narrower entry points so handlers carry the cost of the data they actually need: - ``search_route_datas(querier: BatchQuerier)`` returns bare ``RouteData`` from ``RoutingRow`` alone; the coordinator builds a ``BatchQuerier`` from each handler's ``RouteTargetStatuses`` (now ``lifecycle``/``health``/optional list ``traffic``) at the call site with ``RouteConditions``. - ``fetch_health_check_configs(revision_ids)`` returns a per-revision ``ModelHealthCheck | None`` map, called from ``HealthCheckRouteHandler``, ``AppProxySyncRouteHandler``, ``RouteHealthObserver``, and ``RouteExecutor.check_running_routes`` to gate their per-route behaviour. ``RouteHealthCheckFilter``, ``RouteHandler.health_check_filter()`` and every override are removed; ``RouteSessionData`` is dropped and ``RouteData.session_id`` becomes a direct field again. In ``check_running_routes``, the ``(RouteData, ModelHealthCheck)`` pairs are bundled into a private ``_RouteWithHealthCheck`` frozen dataclass that flows down through ``_ensure_health_records`` and ``_initialize_health_records``; ``_populate_replica_info`` now also mutates ``replica_host``/``replica_port`` on the in-memory routes so the newly populated rows fall through to the single Phase-4 record init. ``sync_appproxy`` self-fetches ``session_statuses`` instead of relying on the dropped SQL join. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``AppProxySyncRouteHandler`` no longer fetches per-revision health-check config; the (``HEALTHY`` OR no-probe) eligibility predicate is owned by ``RouteExecutor.sync_appproxy``, which already does the session-status lookup for the same route set. The handler shrinks to a passthrough and drops its ``DeploymentRepository`` dependency. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``HealthCheckRouteHandler`` shrinks to a passthrough and drops its ``DeploymentRepository`` dependency. The (``hc is not None``) filter that keeps the probe loop from misclassifying opt-out routes as stale now lives at the head of ``RouteExecutor.check_route_health``, alongside the Valkey record lookup it gates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The hc_pairs filter was gating ``routes_missing_replica`` as well, so revisions that omit ``service.health_check`` never had their ``replica_host``/``replica_port`` populated. ``sync_appproxy`` would then reject them indefinitely with "no replica connection info to sync", breaking the BA-5985 promise that opt-out routes start receiving traffic as soon as they reach RUNNING. Split the two phases: - Replica population now runs over all session-verified successes, hc-agnostic, since AppProxy registration needs host/port regardless of probing policy. - RouteHealthRecord initialization stays hc-gated — opt-out revisions never get probed, so they don't need a Valkey record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…agination The original BA-5985 scope (allow deployments to omit ``service.health_check`` so the manager skips probing and routes receive traffic immediately at RUNNING) is being moved out of this ticket. This commit reverts the 17 commits on the branch that delivered or scaffolded that capability — the ``deployment_revisions.health_check_enabled`` column and its GQL/DTO/adapter exposure, the ``RouteHealthCheckFilter`` variants, all hc-eligibility filtering inside route handlers, the ``RouteTargetStatuses.traffic`` field (added as opt-out scaffolding), and the BatchQuerier-based per-handler hc-config fetch. Net tree state equals ``origin/main``. ``changes/11534.feature.md`` is rewritten to describe the new (narrower) scope: a traffic-status filter axis on route handlers, BatchQuerier adoption in the coordinator's lifecycle scan, and pagination on that scan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lter axis Replaces the bespoke ``DeploymentRepository.get_routes_by_statuses`` with ``search_route_datas(*, querier: BatchQuerier)`` so the route coordinator's lifecycle and observer scans funnel through the same ``execute_batch_querier`` plumbing as every other deployment search. ``RouteTargetStatuses`` gains a ``traffic: list[RouteTrafficStatus] | None`` axis; handlers that set it constrain the scan to the listed ``traffic_status`` values, others pass ``None`` and the filter is skipped. Both call sites build their conditions explicitly at the coordinator boundary and use ``NoPagination`` — bounded-scan pagination is left for a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4d4e1a9 to
b662afd
Compare
HyeockJinKim
approved these changes
May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DeploymentRepository.get_routes_by_statuseswithsearch_route_datas(*, querier: BatchQuerier)so route coordinator scans funnel through the standardexecute_batch_querierplumbing.RouteTargetStatusesgains atraffic: list[RouteTrafficStatus] | Noneaxis; handlers that set it constrain the scan to listed traffic statuses, others passNoneand the predicate is skipped.NoPagination— bounded scans are left for a follow-up.Test plan
pants check src/ai/backend/manager::pants test tests/unit/manager/sokovan/deployment/route::Resolves BA-5985
📚 Documentation preview 📚: https://sorna--11534.org.readthedocs.build/en/11534/
📚 Documentation preview 📚: https://sorna-ko--11534.org.readthedocs.build/ko/11534/