Skip to content

feat: v1 API overhaul (/v1 namespace, 4 inference verbs, recipe discovery)#103

Merged
marevol merged 46 commits into
mainfrom
feat/v1-api
May 22, 2026
Merged

feat: v1 API overhaul (/v1 namespace, 4 inference verbs, recipe discovery)#103
marevol merged 46 commits into
mainfrom
feat/v1-api

Conversation

@marevol

@marevol marevol commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Summary

Replaces the alpha-era POST /predict/{name} surface with a versioned v1 HTTP API mounted under /v1, exposing four inference verbs (single/batch × user/related), recipe discovery, and lifted health/metrics endpoints.

  • 9 endpoints, AIP-136 colon-verb pattern (Vertex AI-style)
  • Algolia-style batch body ({requests: [...]}) with per-element status for partial failures
  • Artifact / signing / hot-swap / X-API-Key auth unchanged
  • Pre-existing alpha POST /predict/{name} and GET /models removed (alpha → v1 migration table in docs/migration-v1.md)

Endpoints

Method Path Purpose
POST /v1/recipes/{name}:recommend user → items (single)
POST /v1/recipes/{name}:recommend-related seed items → items (single)
POST /v1/recipes/{name}:batch-recommend user bulk
POST /v1/recipes/{name}:batch-recommend-related seed bulk
GET /v1/recipes list loaded recipes
GET /v1/recipes/{name} recipe detail (capability advertise)
GET /v1/health unauthenticated liveness
GET /v1/health/details authenticated diagnostics
GET /v1/metrics Prometheus (opt-in)

Design background

Industry survey (AWS Personalize, Vertex AI Search for Retail, Azure Personalizer, Algolia Recommend, Spotify, Recombee) drove the path-prefix + custom-verb + Algolia-batch hybrid. Full design in docs/specs/2026-05-21-v1-api-overhaul-design.md and 15-task TDD plan in docs/plans/2026-05-21-v1-api-overhaul.md.

Notable decisions

  • Schemas: RecommendRequest/Response, batch variants, RecipeSummary/Detail in new src/recotem/serving/schemas.py. RecommendItem allows extra (metadata passthrough); requests are strict.
  • Error codes: UNKNOWN_USER, UNKNOWN_SEED_ITEMS, RECIPE_UNAVAILABLE, RECIPE_NOT_FOUND, VALIDATION_ERROR.
  • Partial failures: HTTP 200 with per-element status: ok|error; HTTP 503 only when the recipe itself is unavailable.
  • model_version: sha256:<hex> derived from _loaded_marker[1]; surfaced on every recommend response plus X-Recotem-Model-Version header.
  • Batch endpoints intentionally do not join per-item metadata (single endpoints still do). Documented in docs/api-reference.md and CHANGELOG.md.

Test plan

  • uv run pytest -q tests/unit tests/integration → 1614 passed, 4 deselected
  • uv run ruff check src tests clean; uv run ruff format --check clean
  • Live smoke test with examples/quickstart/ artifact: /v1/health, :recommend, :recommend-related, :batch-recommend, /v1/recipes all return 200 with the expected envelope
  • Final review caught a bug where _try_load_artifact (startup-scan path) didn't populate loaded_at_unix/config_digest/algorithms; fixed in d29b1aa with a regression test
  • Reviewer to gut-check error code naming and the partial-failure contract before merge

Docs updated

  • README.md quickstart
  • docs/getting-started.md
  • docs/operations.md (SLO/metrics table reflects v1 verbs)
  • docs/api-reference.md (new — authoritative endpoint reference)
  • docs/migration-v1.md (new — alpha → v1 mapping)
  • CLAUDE.md (legacy references purged)
  • CHANGELOG.md (new — Unreleased section)

Known follow-ups (not blocking merge)

  • schemas.py:loaded_at: str could become AwareDatetime for OpenAPI-side validation.
  • X-Recotem-Metadata-Degraded header is documented but not currently emitted by v1 endpoints (server-side metric still recorded).
  • Consider renaming serving/routes.py to serving/_metadata_join.py now that only _lookup_metadata lives there.

marevol added 30 commits May 21, 2026 13:08
Captures the design decisions (industry survey of major recommendation APIs,
business motivation, endpoint catalogue, schemas, status codes, metrics) and
the 15-task TDD plan for delivering the v1 surface.
Confirms /v1/recipes/{name}:recommend style paths route and appear
in OpenAPI before refactoring routes.py. Removed in Task 13.
- POC docstring + plan Task 1 file list: Task 13 -> Task 12 (the file
  is removed alongside the legacy router retirement, not the e2e
  conversion task).
- Spec: drop unresolvable recotem-playground cross-reference and the
  bare 'see separate survey report' sentence (replaced with inline
  citation list of the vendor docs that informed the design).
- Spec §9 acceptance criterion: OpenAPI is mounted at /openapi.json,
  not /v1/openapi.json.
Introduces RecommendRequest/Response, batch variants, RecipeSummary, etc.
Used by the upcoming v1 router. No behaviour change yet.
ModelEntry now carries a v1-shaped artifact identifier (sha256:<hex>),
load timestamp (ISO UTC), inference kind, and the list of supported
verbs so the upcoming /v1/recipes endpoint can publish them without
re-reading artifact files. The watcher passes loaded_at_unix /
config_digest / algorithms on every successful (re-)load.
Introduces recotem_v1_requests_total{recipe,verb,status},
recotem_v1_request_latency_seconds{recipe,verb}, and
recotem_v1_batch_size{recipe,verb} histograms. Legacy
record_predict() remains untouched and will be removed in Task 12.
Mirrors the legacy make_router signature so app.py can swap routers
in Task 12. Inference endpoints land in Tasks 7-10.
Adds the first inference endpoint to the v1 router. Returns the new
RecommendResponse envelope (request_id / recipe / model_version / items)
with structured error detail bodies:

- 503 RECIPE_UNAVAILABLE when entry missing or not loaded
- 404 UNKNOWN_USER when the recommender raises KeyError
- 422 from Pydantic validation (e.g. empty user_id)

The metadata join reuses the legacy _lookup_metadata helper from
routes.py (kept until Task 12 retires make_router) and the metrics
hook routes through _metrics.record_v1_request(name, "recommend", ...).

Also updates the Task 5 skeleton test to assert against a verb the
router does not define, so it remains valid as inference routes land.
- app.py now wires make_v1_router(...) at prefix=/v1
- routes.py reduced to the _lookup_metadata helper (still imported by
  v1_router)
- legacy tests/unit/test_serving_routes.py removed
- POC test (test_v1_colon_path_poc.py) removed
- routes-dependency-introspection regression test removed (its
  invariants apply only to the deleted v0 router; v1_router uses
  Annotated[] which is compatible with from __future__ import
  annotations)
- test_serving_app.py and test_cli.py probe paths migrated to /v1/*
  (e.g. /health -> /v1/health, /predict/{x} -> /v1/recipes/{x}:recommend,
  /models -> /v1/recipes); make_router monkey-patch references switched
  to make_v1_router
- test_v1_router_basics.py: stale 404-on-undefined-verb assertion
  rewritten to probe a verb that does not exist as a route
- metadata/loader.py docstring xref updated to point at make_v1_router
Replaces /predict/{name} calls with /v1/recipes/{name}:recommend and
adds a :recommend-related coverage case using the existing quickstart
artifact.
…ion-v1

Aligns published documentation with the v1 API surface.
_try_load_artifact constructed ModelEntry without loaded_at_unix /
config_digest / algorithms, so GET /v1/recipes reported
loaded_at='1970-01-01T00:00:00Z' for every recipe at startup until a
hot-swap occurred. Mirrors the watcher._build_entry fix from Task 3.

Also clarifies the X-Recotem-Metadata-Degraded doc bullet to drop the
'legacy code paths' wording (legacy paths were retired in Task 12).
Health probe and recommend call were still hitting /health and
/predict/{name}; v1 mounts them under /v1/health and
/v1/recipes/{name}:recommend with a `limit` field and a flat response
schema. CI e2e job was timing out at the health-wait loop.
- RequestIDMiddleware: echo X-Request-ID on every response (including
  HTTPException and unhandled-error paths) and bind structlog
  contextvars so downstream log lines carry request_id without each
  handler having to pass it. Client-supplied IDs are validated to a
  short charset and replaced with a server-generated one when missing
  or malformed.
- Split RECIPE_NOT_FOUND (404) vs RECIPE_UNAVAILABLE (503) in
  recipe_detail and the four inference handlers; previously a stub
  registry entry (loaded=False) was indistinguishable from an unknown
  recipe, breaking the retry contract documented in api-reference.md.
- Drop the orphan recotem_predict_total / recotem_predict_latency_seconds
  metrics that lingered after /predict/{name} was retired. Inventory
  docstring now lists the v1 metrics, and CHANGELOG + migration-v1
  call out the removed metric names so dashboards/alerts can be
  retargeted at recotem_v1_requests_total /
  recotem_v1_request_latency_seconds.
These directories hold session-scoped planning and design notes that
should not ship in the repository. Add them to .gitignore so future
edits do not accidentally re-add them.
Drop the legacy ``v1_router.py`` / ``routes.py`` split. ``routes.py`` is
now the single router module, ``make_v1_router`` is renamed to
``make_router``, and the ``_lookup_metadata`` helper is inlined.  The
``/v1`` URL prefix is still applied at mount time via
``app.include_router(..., prefix="/v1")``.

Also folds in expanded test coverage that was in flight:
``X-Recotem-Model-Version`` response-header round-trip, recipe-name path
regex enforcement, schema field round-trips, dict / DataFrame metadata
enrichment paths, and additional e2e scenarios.
The scheduled nightly suite re-downloaded MovieLens100K from
files.grouplens.org on every run via irspack's MovieLens100KDataManager,
making green CI dependent on an external server that intermittently
times out (run 26212149650 errored at fixture setup with Errno 110).
Drop the workflow entirely; the MovieLens-backed slow tests remain
available via `pytest -m slow` for local runs.
Alpha→v1 migration page is unnecessary for this PR's scope; the v1
API replaces the alpha endpoints outright and the README/operations
docs already cover the current shape.
Address review of PR #103 v1 API overhaul: restore observability
parity with the legacy /predict handler and rationalise the error body
shape across the v1 surface.

Code:
- routes.py: read request_id from request.state (set by
  RequestIDMiddleware) instead of re-parsing the header per handler,
  resolving a body/header split-brain for 65-128 char IDs. Bind
  recipe/kid via structlog.contextvars for the duration of each
  inference handler and unbind in finally. Emit recipe_unavailable
  WARN on every 404 RECIPE_NOT_FOUND / 503 RECIPE_UNAVAILABLE. Add
  (MemoryError, RecursionError) fast-path before the generic Exception
  branch so OOM does not run through logger.exception. Capture
  exc and include error_class on unexpected-error logs.
- app.py: new HTTPException handler flattens dict-shaped details to
  {detail, code} top-level so the body is no longer double-nested.
  Defensive setdefault on detail using _DEFAULT_DETAIL_FOR. New
  RequestValidationError handler returns
  {request_id, detail, code: VALIDATION_ERROR, errors} and records
  recotem_v1_requests_total{status=validation_error} when the path
  matches a v1 inference verb. _unhandled_exception_handler now
  attaches X-Request-ID from request.state because
  ServerErrorMiddleware wraps outside RequestIDMiddleware.

Docs:
- api-reference.md: align X-Request-ID regex to {1,128}, drop 403
  from status code lists, add RECIPE_NOT_FOUND to per-endpoint codes,
  document flat error body envelope and 422 / 500 shapes.
- operations.md: rewrite error body samples from {"error": {...}} to
  the actual flat {"detail", "code"} shape, add 422 section.
- security.md: update trust boundary diagram, inference response
  section, and nginx rate-limit example to /v1/* (zone recotem_v1).
- Remove stale /predict references from registry.py, app.py,
  metrics.py (Prometheus help text), recipe/models.py,
  metadata/loader.py, CONTRIBUTING.md, compose.yaml, README and
  example recipes.

Tests:
- Update v1 unit tests for the flat error envelope.
- Add tests/conftest.py build_v1_app helper that mirrors create_app
  wiring (RequestIDMiddleware + all three exception handlers).
- New tests/unit/test_v1_error_handling.py covers X-Request-ID
  consistency, flat error body shape across 401/404/422/500/503,
  request_id correlation, 405 method-not-allowed, FastAPI auto-404,
  path-param validation, contextvars cleanup, and exception-handler
  parity between create_app and build_v1_app.

Full test suite: 1716 passed, 4 deselected. ruff check + format clean.
…rity, signals)

Resolves the blockers and important issues from the multi-agent review of
the v1 API overhaul:

- Auth: require X-API-Key for /v1/metrics (was unauthenticated).
- Schemas: extra="forbid" on every request/response model except RecommendItem
  (metadata passthrough). Per-element max_length on seed_items/exclude_items.
  Aggregate batch-cap validator (sum of limits <= 5000). kind/supported_verbs
  as Literal. loaded_at validated as ISO-8601 UTC. score allow_inf_nan=False.
  RecipeDetailResponse no longer inherits from RecipeSummary.
- exclude_items: wired into :recommend and :recommend-related as a
  post-filter (was accepted but silently ignored).
- Typed responses: handlers return RecommendResponse/BatchRecommendResponse
  instead of JSONResponse, with X-Recotem-Model-Version set on the response
  object.
- Batch error handling: batch-recommend-related now has per-element
  try/except parity with batch-recommend; both broadened to catch generic
  Exception (MemoryError/RecursionError re-raised), so one bad element no
  longer 500s the whole batch.
- Error message hygiene: stop echoing user_id/seed_items in error bodies;
  rely on the machine-readable code field (mitigates membership-oracle).
- Metadata-Degraded signal: _lookup_metadata returns (fields, degraded);
  single-verb handlers set X-Recotem-Metadata-Degraded: 1 when join fails.
- /v1/recipes/{name} detail restores trained_at, best_class, best_params,
  best_score, metric, cutoff, tuning, data_stats, recotem_version,
  irspack_version, recipe_hash (previously dropped from GET /models).
- Handler refactor: _resolve_entry helper + _request_metrics context
  manager removes ~150 lines of duplicated prelude across the four verbs.
- 500 handler body now includes request_id (symmetric with 422 handler).
- 422 handler strips raw input/ctx from error dicts (prevents echo of
  client-supplied secrets).
- Path-param regex tightened to require alphanumeric first char.
- _REQUEST_ID_RE narrowed from 128 to 64 chars to match docs.
- last_load_error sanitized (URI redaction + 200-char cap) before storage
  in the registry / details endpoint.
- _lookup_metadata warning log truncates item_id and rate-limits to 10
  per (recipe, kind) tuple to prevent log flooding.
- health() simplified: dead second loop removed.
- watcher sidecar handling: TypeError/OSError no longer silently swallowed.
- registry.models_dict docstring updated (legacy /models endpoint removed).
- Posture warning loop cadence: 5 min outside test env.
- Test fixture: defensively unregister v1 Prometheus collectors by name
  to survive cross-test state leakage.

CI:
- secrets-in-logs grep: exclude public model_version field and
  X-Recotem-Model-Version header from sha256:<hex64> false positives.
…d 422 sanitization

Add tests covering auth requirements on related/batch endpoints, request_id
echoing in error envelopes, 422 error dict sanitization (input/ctx stripped),
path regex leading-char rejection, and kid contextvar log binding. Update
conftest's build_v1_app to mirror the production 422 sanitization so tests
exercise the real response shape.
…alidation

Round of v1 API maturation based on review feedback:

- Add NO_CANDIDATES (404) to distinguish ranker survival failure from
  UNKNOWN_SEED_ITEMS in :recommend-related and per-element batch.
- Switch unknown-recipe response from 503 recipe_unavailable to 404
  RECIPE_NOT_FOUND; clients should treat 404 as hard fail vs 503 retry.
- Uppercase all error codes (MISSING_API_KEY / INVALID_API_KEY /
  INTERNAL_ERROR / VALIDATION_ERROR) for consistency.
- Validate batch sub-requests per-element (bad ones surface as
  status=error code=VALIDATION_ERROR); aggregate sum(limit)<=5000 cap
  now also enforced per-element.
- Expand recotem_v1_requests_total status enum and add
  recotem_v1_batch_element_errors_total{recipe,verb,code}; add reason
  label on recotem_artifact_load_failures_total so HMAC failures (a
  security signal) are alertable independently.
- Drop dead X-Recotem-Metadata-Degraded code path (unreachable since
  metadata_index is populated at every load).
- Relax X-Request-ID echo regex to {1,128} and /v1/recipes/{name} path
  regex to ^[A-Za-z0-9_-]{1,64}$ to match recipe-loader constraints.
- Log batch per-element failures with logger.exception and the actual
  exc_type; raise startup HMAC verify failures to ERROR with exc_info.
Silent failures & observability
- _any_seed_known + user_known AttributeError: log + recotem_recommender_layout_unexpected_total + INTERNAL_ERROR propagation
- KeyError mis-attribution fixed: pre-check membership; unexpected KeyError -> INTERNAL_ERROR via logger.exception
- _unhandled_500 / validation_failed now structured-log with sanitized errors and request_id
- Auth bypass log carries mode (insecure_no_auth vs loopback_no_keys)
- Watcher: recotem_watcher_state_divergence_total, dir_scan failure reason, sidecar_disappeared transition warning
- inc_metadata_lookup_error wired via on_row_error callback in build_metadata_index (serving/metadata layering preserved)

Schema & type design
- BatchResultEntry as discriminated union (_BatchResultOk | _BatchResultErr) removes anti-pattern
- Sha256Hex / HexHash branded types applied to model_version, config_digest, recipe_hash (None for stub entries)
- ModelEntry.loaded_at returns tz-aware datetime; loaded_at/trained_at typed as AwareDatetime
- RecipeDetailResponse: metric Literal, cutoff Field(ge=1), version Field(pattern=...)
- RecommendRequest.context dropped (restores extra=forbid integrity)
- RecommendItem.item_id bounded matching _ItemStr

Dead code & convention
- inc_metadata_lookup_error / metadata_field_deny removed/wired
- _RecipeName Annotated alias -> name: str = Path(pattern=...) per CLAUDE.md
- _batch_error_entry typed against ErrorCode; drop type-ignore
- _emit_security_posture: logger.exception spec alignment

API contract
- include_metadata: bool = False on batch requests restores batch <-> single shape parity (opt-in)
- batch_user_known re-initialized per loop iteration

Tests +178 (1614 -> 1792): discriminated union, branded types, AwareDatetime, include_metadata opt-in, KeyError attribution, auth bypass mode, watcher dir_scan/sidecar_disappeared, structured 500/422 logs, batch-recommend-related empty list 422, HTTP-layer concurrent hot-swap, recipe_not_found across all 4 verbs.
marevol added 16 commits May 22, 2026 13:12
Create docs/migration-v1.md covering endpoint mapping, field renames,
flat error envelope, X-Recotem-Metadata-Degraded removal, metrics
renames (no dual-emit), /v1/metrics auth requirement with Prometheus
scrape-config snippet, batch metadata opt-in, and partial-failure
semantics. Add Migration subsection in CHANGELOG.md linking to the guide.
Add explicit warning in api-reference.md that GET /v1/metrics now
requires X-API-Key (the alpha /metrics was unauthenticated) and
cross-link to the migration guide for the Prometheus scrape-config
snippet.
- Add _sanitize_validation_errors() and _format_batch_validation_message()
  helpers so both batch handlers share the same logic.
- Both batch_recommend and batch_recommend_related now build a human-readable
  loc+msg string from exc.errors()[0] and log sanitized error details (loc,
  msg, type only — no user input) at WARNING level.
- Rename _BatchResultOk/_BatchResultErr to BatchResultOk/BatchResultErr
  (M4: drop underscore prefix since they appear in the public OpenAPI schema).
- Update all tests to use the new public names and assert that VALIDATION_ERROR
  messages contain the violating field name.
Add a callout in operations.md (structured-log events section) explaining
that the alpha X-Recotem-Metadata-Degraded per-response header is gone
in v1 and directing operators to recotem_metadata_lookup_errors_total
for load-time metadata join failures.
- Add "timeout" to _LOAD_FAILURE_REASONS in metrics.py (stat hangs in the
  executor are distinct from read errors — infrastructure vs. data signal).
- Stat-timeout path in _poll_artifacts now passes reason="timeout".
- Both _read_artifact_bytes failure paths in _load_recipe now pass
  reason="read" (previously defaulted to "unexpected").
- Hoist `import errno` to module top in watcher.py (M8: was lazy import
  inside _check_sidecar_changed with noqa suppression).
Include exc_type=type(exc).__name__ and error=str(exc)[:200] in the
metadata_index_row_error warning so operators can diagnose the root cause
(e.g. AttributeError from a non-unique index, TypeError from a non-string
column) without enabling debug logging or adding instrumentation.
- Add recipe_name parameter to _build_items() so the warning and metric
  can be attributed to the correct recipe.
- Wrap RecommendItem.model_validate() in try/except ValidationError:
  on failure, log a metadata_serialization_failed WARNING with item_id
  and truncated error (no user input), increment inc_metadata_lookup_error,
  and skip the item rather than aborting the entire response.
Add integration test that loads artifact A, swaps in artifact B via
registry.replace_with_marker, then verifies model_version and
X-Recotem-Model-Version header change between calls. Both values must
match sha256:<64 hex>, and the header must equal the body field.
Add parametrized tests for all 4 recommend verbs asserting that a
valid-length but wrong X-API-Key returns 401 INVALID_API_KEY (T2).
Add key-rotation tests (T8): configure two keys (old + new), verify
both authenticate with 200 on :recommend, and a third key gets 401.
Add three unit tests verifying that insecure_no_auth=True lets
:recommend, :recommend-related, and :batch-recommend succeed without
an X-API-Key header. Existing dev-bypass tests only hit /v1/health/details;
these cover the actual prediction paths.
Add integration test that starts ArtifactWatcher with a real recipes
directory, verifies the entry exists, deletes the YAML file, waits for
the watcher to process the deletion, then asserts :recommend returns
404 (RECIPE_NOT_FOUND) or 503 (RECIPE_UNAVAILABLE).
Add test_recommend_related_includes_metadata_fields and
test_recommend_related_strips_denied_fields to mirror the existing
:recommend coverage for the related verb. Denied fields applied at
load time must not appear in :recommend-related items.
Add tests verifying that recotem_v1_requests_total counters accumulate
value (one Prometheus line per distinct label-set, not one per request)
and that the recipe_not_found status label is correctly assigned for
RECIPE_NOT_FOUND 404 responses.
Add three tests verifying the per-recipe shape returned by
/v1/health/details: healthy entries include loaded=True, best_class,
trained_at, kid, and no error field; stub entries include loaded=False
and an error string. Two-recipe scenario (1 healthy + 1 stub) confirms
the degraded aggregate status and correct per-recipe fields.
M1: Add ModelRegistry.health_counts() -> (loaded, total) returning both
    values under a single lock so /v1/health cannot observe a TOCTOU split
    between loaded_count() and health_snapshot().
M2: Add code comments near both health handlers documenting the intentional
    design difference (probe vs. operator endpoint).
M3: app.py startup now calls registry.loaded_count() for set_active_recipes
    instead of len(loaded_entries) to avoid a count desync.
M5: Consolidate the banner warn_loop: capture flags into local variables so
    only one asyncio task fires per interval even when both --insecure-no-auth
    and --dev-allow-unsigned are active.
M6: Guard best_score in recipe_detail against NaN/Inf by converting non-finite
    floats to None before returning (matches RecommendItem.score posture).
M7: Add sidecar_unsupported sentinel to _RecipeWatchState so TypeError in
    _check_sidecar_changed only warns once per recipe rather than every poll.
Apply review fixes across critical, major, and minor categories from the
v1 API overhaul review.

Critical:
- Restore response-side degradation signal: emit X-Recotem-Items-Degraded
  and recotem_v1_metadata_degraded_items_total with fallback/dropped path
  in _build_items (single recommend verbs only).
- Split recotem_metadata_lookup_errors_total into
  recotem_metadata_index_build_errors_total{recipe} (load-time) and
  recotem_metadata_serialization_errors_total{recipe,verb} (request-time)
  to disambiguate on-call routing.
- Re-evaluate sidecar after recipe YAML mtime change instead of permanent
  skip on TypeError or repeated transient OSError.

Major:
- Populate ModelEntry.algorithms from header tuning.tried_algorithms when
  the top-level field is absent.
- Normalize config_digest at the ModelEntry boundary so the sha256: prefix
  matches the Sha256Hex schema regardless of writer convention.
- Split _resolve_entry's recipe_unavailable warning into recipe_not_found
  (404) and recipe_not_loaded (503) for distinct alert routing.
- Drop route-level logger.exception in verb handlers; rely on the global
  _unhandled_exception_handler for a single unhandled_500 emission.
- Reset _post_hmac_failure_streak in the generic-Exception branch of
  watcher._load_recipe so the streak only counts deserialize failures.

Minor:
- Bind recipe contextvar in recipe_detail / list_recipes.
- Use reason="unexpected" for non-ArtifactError reads in
  _record_load_failure.
- Replace NUL-byte kid sentinel with "<extract_failed>".
- Include user_id_hash / seed_items_count in 500 logs for debugging.
- Add recotem_v1_validation_errors_outside_verb_total for non-verb 422s.
- Default _classify_artifact_error to "unexpected" with a WARN log.
- Log security.posture even when validate_insecure_flags raises.
- Add signing_key_status="construction_failed" for keyring build errors.
- Suppress sidecar reload storms on non-ENOENT OSError after 3 strikes.
- Centralize stub-name dedup in serving/_naming.py.
- Centralize header field extraction in serving/_header_utils.py.

Browser interop / observability:
- Add CORS expose_headers for X-Request-ID, X-Recotem-Model-Version, and
  X-Recotem-Items-Degraded so JS clients can read them cross-origin.
- Whitelist kind label values in inc_metadata_degraded_items to bound
  Prometheus label cardinality.
- Update recotem_recommender_layout_unexpected_total HELP text to mention
  both user_id_to_index and item_id_to_index probes.

Tests:
- Cover the new degraded fallback / dropped paths via monkeypatched
  RecommendItem.model_validate.
- Add include_metadata and exclude_items coverage for
  :batch-recommend-related (mirroring :batch-recommend).
- Assert outer-except partial-failure behavior for batch verbs.
- Assert sidecar_unsupported clears on YAML mtime change.
- Assert _post_hmac_failure_streak resets after generic exceptions.
- Assert 404 / 503 log events use the new recipe_not_found and
  recipe_not_loaded names; assert exactly one unhandled_500 per failure.
- Add Sha256Hex round-trip for normalize_config_digest with a 64-hex
  sample.
- Rename stale test names from the old recipe_unavailable wording.

Docs:
- Remove CHANGELOG.md and docs/migration-v1.md (alpha was internal-only;
  no public migration path needed).
- Update docs/api-reference.md, docs/operations.md, docs/security.md, and
  signing.py comments to drop CHANGELOG references and document the new
  X-Recotem-Items-Degraded header.
@marevol marevol added this to the 2.0.0a1 milestone May 22, 2026
@marevol marevol merged commit b4d6b95 into main May 22, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant