test: E2E tests for multi-provider traffic splitting#214
Conversation
Implements inference.opendatahub.io/v1alpha1 API group with two CRDs per Nir's ExternalProvider proposal: - ExternalProvider: provider endpoint, credentials, and provider-specific config for per-provider settings (Vertex AI project/location/etc.) - ExternalModel: client-facing model name with provider ref, targetModel, apiFormat. MaxItems=1 on externalProviderRefs for Phase 1. Config field uses map[string]string rather than map[string]any: - Typed validation in the CRD schema (no x-kubernetes-preserve-unknown-fields) - Clean JSON round-trip (map[string]any converts numbers to float64) - Covers all known use cases (project, location, endpoint are strings) - Can be promoted to a typed struct if nested config is needed later Adds Makefile targets: generate, manifests, verify-codegen.
- NameReference: add RFC 1123 pattern (lowercase alphanumeric + hyphens) - Endpoint: require at least one dot to enforce FQDN (reject single-label) - Bump controller-gen v0.16.4 → v0.20.1 (aligns with controller-runtime v0.23.3) - Add regex validation tests for both patterns
Address Nir's review: - apiFormat: changed from optional to required per API evolution rule (required → optional is non-breaking, opposite is breaking) - Phase status: clarified that Pending/Ready/Failed reflect reconciliation state (missing provider, missing secret), not runtime request health
…binary
Adds the control-plane controller that watches the new CRDs and creates
Istio networking resources:
ExternalProvider reconciler:
- Creates shared Service, ServiceEntry, DestinationRule per provider
- Validates referenced Secret exists (Phase=Failed/SecretNotFound if missing)
- Idempotent writes, OwnerReference GC, self-healing via Owns(&Service{})
ExternalModel reconciler:
- Creates HTTPRoute per model, backend ref to provider's shared Service
- Cross-watches ExternalProviders — endpoint changes propagate automatically
- Self-healing via Owns(&HTTPRoute{})
Controller binary (cmd/controller):
- --gateway-name/--gateway-namespace flags (same pattern as MaaS controller)
- Health probes, leader election, SecurityContext hardening
Tests: 9 unit + 6 envtest (externalprovider), 3 unit + 5 envtest (externalmodel)
…-first, configurable timeout 1. Istio self-healing: add Watches() for ServiceEntry and DestinationRule with label predicate + EnqueueRequestForOwner, so external deletion triggers re-reconciliation (same as Service via Owns) 2. ExternalModel checks provider Phase: if provider is not Ready, model sets Phase=Failed instead of creating an HTTPRoute to an unconfigured endpoint 3. Validate-before-create: ExternalProvider reconciler validates secret exists before creating networking resources. Failed validation = no resources created, Phase=Failed 4. Configurable HTTPRoute timeout: --default-route-timeout flag (default 300s) on controller binary, passed through to buildHTTPRoute 5. Leader election comment in sample deployment YAML
Replace maas.opendatahub.io ExternalModel watcher with two inference.opendatahub.io watchers: - ExternalProvider reconciler → providerInfoStore (endpoint, creds, config) - ExternalModel reconciler → resolves provider ref, writes to modelInfoStore Provider config (map[string]string) flows through to CycleState via ProviderConfigKey for downstream plugins (api-translation). ProcessRequest unchanged — same CycleState keys, same downstream behavior. Breaking change: BBR no longer resolves maas.opendatahub.io ExternalModel CRs. RBAC updated to inference.opendatahub.io in Helm templates.
1. Cross-watch: ExternalModel plugin reconciler now Watches() ExternalProvider changes via mapProviderToModels. When a provider is updated (e.g., credential rotation), all referencing models are re-reconciled so modelInfoStore stays current. Without this, stale credentials persist in the data plane until the ExternalModel CR is touched. 2. Propagation test: verifies that updating provider credentials in providerStore and re-reconciling the model picks up the new values in modelStore.
- Model reconciler: delete stale modelStore entry when externalProviderRefs is empty, malformed, or missing provider ref name - Provider reconciler: delete stale providerStore entry when required fields become empty (valid→invalid spec transition) - Log List errors in mapProviderToModels instead of silently returning nil
The api-translation plugin now prefers the apiFormat field (from ExternalModel.spec.externalProviderRefs[].apiFormat) over the provider type (from ExternalProvider.spec.provider) when selecting the request/ response translator. This enables explicit API format control — e.g., a Bedrock provider can serve both OpenAI-compatible and Anthropic-format models by specifying different apiFormat values per model. Flow: ExternalModel CR → plugin reconciler extracts apiFormat → CycleState (APIFormatKey) → api-translation resolves translator. Falls back to ProviderKey if APIFormatKey is not set.
Enables an ExternalModel to reference multiple ExternalProviders with
relative weights for traffic splitting.
CRD:
- Remove MaxItems=1 on externalProviderRefs
- Add Weight field (*int32, default=1, minimum=0)
Controller (ExternalModel reconciler):
- Resolves all provider refs, creates HTTPRoute with per-provider rules
- Rule 1: path-based match (Kuadrant compatibility)
- Per-provider rules: X-Selected-Provider header match with provider-
specific backend ref and Host header for TLS SNI
BBR plugin (model-provider-resolver):
- Weighted random selection across provider refs (precomputed totalWeight)
- Sets X-Selected-Provider header for deterministic Envoy routing
- Rewrites body model field to selected provider's targetModel
- Model name validation: checks CR name (client-facing) not targetModel
Example:
externalProviderRefs:
- ref: {name: openai-provider}
targetModel: gpt-4o
apiFormat: openai
weight: 80
- ref: {name: bedrock-provider}
targetModel: gpt-4o-bedrock
apiFormat: bedrock-openai
weight: 20
E2E verified: 14/6 OpenAI/Anthropic distribution over 20 requests
(configured 80/20).
Adds 3 test cases to the new-CRD E2E suite for multi-provider: - HTTP 200 for multi-provider model - Traffic distribution: verifies both providers receive traffic with correct weight ratio (80/20 OpenAI/Anthropic over 50 requests) - Model field rewrite: verifies body.model is rewritten to targetModel Also adds apiFormat to single-provider test setup (now required per Nir's review on PR opendatahub-io#182). Total: 41 tests (38 single-provider + 3 multi-provider).
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yossiovadia The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Central YAML (base), Organization UI (inherited) Review profile: CHILL Plan: Enterprise Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (44)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 39 minutes and 13 seconds.Comment |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Note on file count: This PR is stacked on top of #213 → #212 → #184 → #183 → #182. GitHub shows the full diff from main, not just this PR's delta. The actual changes in this PR are 3 files (multi-provider test cases + apiFormat fix in e2e-new-crds suite). Once the dependency chain merges, GitHub will automatically recalculate the diff. |
Summary
Adds multi-provider traffic splitting tests to the new-CRD E2E suite. Creates an ExternalModel with 80/20 weighted OpenAI/Anthropic providers and verifies traffic distribution, model field rewrite, and response format.
Depends on #213 (multi-provider weights) and #187 (new-CRD E2E suite).
Test cases
Verified