feat: multi-provider refs with weighted traffic splitting#213
feat: multi-provider refs with weighted traffic splitting#213yossiovadia wants to merge 13 commits into
Conversation
Implements inference.opendatahub.io/v1alpha1 API group with two CRDs per Nir's ExternalProvider proposal: - ExternalProvider: provider endpoint, credentials, and provider-specific config for per-provider settings (Vertex AI project/location/etc.) - ExternalModel: client-facing model name with provider ref, targetModel, apiFormat. MaxItems=1 on externalProviderRefs for Phase 1. Config field uses map[string]string rather than map[string]any: - Typed validation in the CRD schema (no x-kubernetes-preserve-unknown-fields) - Clean JSON round-trip (map[string]any converts numbers to float64) - Covers all known use cases (project, location, endpoint are strings) - Can be promoted to a typed struct if nested config is needed later Adds Makefile targets: generate, manifests, verify-codegen.
- NameReference: add RFC 1123 pattern (lowercase alphanumeric + hyphens) - Endpoint: require at least one dot to enforce FQDN (reject single-label) - Bump controller-gen v0.16.4 → v0.20.1 (aligns with controller-runtime v0.23.3) - Add regex validation tests for both patterns
Address Nir's review: - apiFormat: changed from optional to required per API evolution rule (required → optional is non-breaking, opposite is breaking) - Phase status: clarified that Pending/Ready/Failed reflect reconciliation state (missing provider, missing secret), not runtime request health
…binary
Adds the control-plane controller that watches the new CRDs and creates
Istio networking resources:
ExternalProvider reconciler:
- Creates shared Service, ServiceEntry, DestinationRule per provider
- Validates referenced Secret exists (Phase=Failed/SecretNotFound if missing)
- Idempotent writes, OwnerReference GC, self-healing via Owns(&Service{})
ExternalModel reconciler:
- Creates HTTPRoute per model, backend ref to provider's shared Service
- Cross-watches ExternalProviders — endpoint changes propagate automatically
- Self-healing via Owns(&HTTPRoute{})
Controller binary (cmd/controller):
- --gateway-name/--gateway-namespace flags (same pattern as MaaS controller)
- Health probes, leader election, SecurityContext hardening
Tests: 9 unit + 6 envtest (externalprovider), 3 unit + 5 envtest (externalmodel)
…-first, configurable timeout 1. Istio self-healing: add Watches() for ServiceEntry and DestinationRule with label predicate + EnqueueRequestForOwner, so external deletion triggers re-reconciliation (same as Service via Owns) 2. ExternalModel checks provider Phase: if provider is not Ready, model sets Phase=Failed instead of creating an HTTPRoute to an unconfigured endpoint 3. Validate-before-create: ExternalProvider reconciler validates secret exists before creating networking resources. Failed validation = no resources created, Phase=Failed 4. Configurable HTTPRoute timeout: --default-route-timeout flag (default 300s) on controller binary, passed through to buildHTTPRoute 5. Leader election comment in sample deployment YAML
Replace maas.opendatahub.io ExternalModel watcher with two inference.opendatahub.io watchers: - ExternalProvider reconciler → providerInfoStore (endpoint, creds, config) - ExternalModel reconciler → resolves provider ref, writes to modelInfoStore Provider config (map[string]string) flows through to CycleState via ProviderConfigKey for downstream plugins (api-translation). ProcessRequest unchanged — same CycleState keys, same downstream behavior. Breaking change: BBR no longer resolves maas.opendatahub.io ExternalModel CRs. RBAC updated to inference.opendatahub.io in Helm templates.
1. Cross-watch: ExternalModel plugin reconciler now Watches() ExternalProvider changes via mapProviderToModels. When a provider is updated (e.g., credential rotation), all referencing models are re-reconciled so modelInfoStore stays current. Without this, stale credentials persist in the data plane until the ExternalModel CR is touched. 2. Propagation test: verifies that updating provider credentials in providerStore and re-reconciling the model picks up the new values in modelStore.
- Model reconciler: delete stale modelStore entry when externalProviderRefs is empty, malformed, or missing provider ref name - Provider reconciler: delete stale providerStore entry when required fields become empty (valid→invalid spec transition) - Log List errors in mapProviderToModels instead of silently returning nil
The api-translation plugin now prefers the apiFormat field (from ExternalModel.spec.externalProviderRefs[].apiFormat) over the provider type (from ExternalProvider.spec.provider) when selecting the request/ response translator. This enables explicit API format control — e.g., a Bedrock provider can serve both OpenAI-compatible and Anthropic-format models by specifying different apiFormat values per model. Flow: ExternalModel CR → plugin reconciler extracts apiFormat → CycleState (APIFormatKey) → api-translation resolves translator. Falls back to ProviderKey if APIFormatKey is not set.
Enables an ExternalModel to reference multiple ExternalProviders with
relative weights for traffic splitting.
CRD:
- Remove MaxItems=1 on externalProviderRefs
- Add Weight field (*int32, default=1, minimum=0)
Controller (ExternalModel reconciler):
- Resolves all provider refs, creates HTTPRoute with per-provider rules
- Rule 1: path-based match (Kuadrant compatibility)
- Per-provider rules: X-Selected-Provider header match with provider-
specific backend ref and Host header for TLS SNI
BBR plugin (model-provider-resolver):
- Weighted random selection across provider refs (precomputed totalWeight)
- Sets X-Selected-Provider header for deterministic Envoy routing
- Rewrites body model field to selected provider's targetModel
- Model name validation: checks CR name (client-facing) not targetModel
Example:
externalProviderRefs:
- ref: {name: openai-provider}
targetModel: gpt-4o
apiFormat: openai
weight: 80
- ref: {name: bedrock-provider}
targetModel: gpt-4o-bedrock
apiFormat: bedrock-openai
weight: 20
E2E verified: 14/6 OpenAI/Anthropic distribution over 20 requests
(configured 80/20).
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: yossiovadia The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
📝 WalkthroughWalkthroughThis pull request introduces a complete Kubernetes controller infrastructure for managing external LLM providers and models. It adds two new CRD types ( Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Security & Code Quality IssuesCWE-434 / Unrestricted Endpoint Configuration in ExternalProvider The CWE-639 / Authorization Bypass via Broad RBAC The deployment manifest ( CWE-269 / Improper Authentication in Weight-Based Provider Selection The CWE-347 / Insufficient Verification of Data Authenticity in Secret Reference Validation The CWE-400 / Uncontrolled Resource Consumption via HTTPRoute Multiplicity For each CWE-326 / Weak Cryptography / Unencrypted Data Transport The CWE-95 / Improper Neutralization via HTTPRoute Header Rewriting The Additional Observations
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches⚔️ Resolve merge conflicts
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Review rate limit: 0/1 reviews remaining, refill in 60 minutes.Comment |
There was a problem hiding this comment.
Actionable comments posted: 16
🧹 Nitpick comments (5)
pkg/plugins/model-provider-resolver/provider_store_test.go (1)
75-93: Lines 75-93: concurrent test has no correctness assertion.This test currently passes on “no panic” only. Add post-
Wait()invariants (e.g., any found entry must be non-nil and have non-emptyprovider) so regressions are detectable.Suggested patch
func TestProviderStore_ConcurrentAccess(t *testing.T) { @@ wg.Wait() + + for i := 0; i < 10; i++ { + key := types.NamespacedName{Namespace: "ns", Name: fmt.Sprintf("prov-%d", i)} + if info, found := store.get(key); found { + require.NotNil(t, info) + assert.NotEmpty(t, info.provider) + } + } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/plugins/model-provider-resolver/provider_store_test.go` around lines 75 - 93, Update TestProviderStore_ConcurrentAccess to assert invariants after wg.Wait: call newProviderInfoStore via existing variable `store` and for each possible key used in the goroutines (e.g., NamespacedName with Namespace "ns" and Name fmt.Sprintf("prov-%d", i) for i in 0..9) call `store.get(key)` and add assertions that if the returned pointer is non-nil then its `provider` field is not empty; use testing helpers (t.Fatalf/t.Errorf) to fail the test on violation. Ensure you reference the existing methods `addOrUpdate`, `get`, `delete` and the `providerInfo` struct when adding these checks so regressions become detectable.deploy/controller/samples/external-provider-katan.yaml (1)
1-24: Add comment clarifying these are sample/test resources.The endpoint
3-13-21-181.sslip.io(resolves to3.13.21.181) is clearly not a real provider endpoint. Consider adding a header comment indicating these are for local development/testing to prevent confusion or accidental deployment.📝 Proposed comment
+# Sample ExternalProvider resources for local development/testing. +# Replace endpoint and secretRef with real values for production use. +--- apiVersion: inference.opendatahub.io/v1alpha1 kind: ExternalProvider🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deploy/controller/samples/external-provider-katan.yaml` around lines 1 - 24, Add a clear header comment at the top of the YAML indicating these resources are sample/test/local-development only and not real production endpoints; update the file containing ExternalProvider resources (look for kind: ExternalProvider and metadata names katan-openai-provider and katan-vertex-provider) to include a short comment block explaining that endpoint 3-13-21-181.sslip.io is a placeholder resolving to 3.13.21.181 and should not be used in production or committed without change, and optionally suggest replacing with real endpoints or secrets for deployment.Makefile (2)
71-77: Quote$(CONTROLLER_GEN)in the recipes.These invocations break when the repo path contains spaces and they violate the Makefile rule for shell variables.
Suggested fix
generate: controller-gen ## Generate deepcopy methods for CRD types. - $(CONTROLLER_GEN) object paths="./api/..." + "$(CONTROLLER_GEN)" object paths="./api/..." manifests: controller-gen ## Generate CRD manifests from Go types. - $(CONTROLLER_GEN) crd paths="./api/..." output:crd:artifacts:config=config/crd/bases + "$(CONTROLLER_GEN)" crd paths="./api/..." output:crd:artifacts:config=config/crd/basesAs per coding guidelines, "Quote shell variables in targets".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 71 - 77, The recipes for the generate and manifests targets use the unquoted shell variable $(CONTROLLER_GEN), which breaks when the repo path contains spaces and violates the Makefile rule to quote shell variables; update the commands in the generate and manifests targets to quote the variable (i.e., use "$(CONTROLLER_GEN)" in both the generate recipe and the manifests recipe) so the command is treated as a single word and works with paths containing spaces.
79-81:verify-codegenneeds to be on the default verify path.Right now this target exists, but
make verifystill skips it. Generated CRDs/deepcopy files can drift while local and CI verification still pass.Follow-up change outside this hunk
-.PHONY: verify -verify: tidy vet fmt lint ## Verify the codebase (tidy, vet, fmt, lint). +.PHONY: verify +verify: tidy vet fmt lint verify-codegen ## Verify the codebase (tidy, vet, fmt, lint, codegen).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@Makefile` around lines 79 - 81, The Makefile defines a verify-codegen target but the main verify target doesn't depend on it, so CI/local `make verify` skips codegen checks; update the Makefile so the verify target depends on verify-codegen (add verify-codegen to the dependency list for the verify target) so that calling `make verify` will run the `verify-codegen` rule (target names: verify, verify-codegen).pkg/controller/externalmodel/reconciler.go (1)
199-218: Replace namespace-wide model scans with a field index on provider refs.On every
ExternalProviderevent,mapProviderToModelslists allExternalModelobjects in the namespace and scans all refs. This degrades quickly with tenant/model count and increases reconcile latency.Proposed refactor
func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error { + if err := mgr.GetFieldIndexer().IndexField(context.Background(), + &inferencev1alpha1.ExternalModel{}, + "spec.externalProviderRefs.ref.name", + func(raw client.Object) []string { + m := raw.(*inferencev1alpha1.ExternalModel) + out := make([]string, 0, len(m.Spec.ExternalProviderRefs)) + for _, ref := range m.Spec.ExternalProviderRefs { + out = append(out, ref.Ref.Name) + } + return out + }, + ); err != nil { + return err + } + return ctrl.NewControllerManagedBy(mgr). @@ } func (r *Reconciler) mapProviderToModels(ctx context.Context, obj client.Object) []reconcile.Request { provider := obj.(*inferencev1alpha1.ExternalProvider) modelList := &inferencev1alpha1.ExternalModelList{} - if err := r.List(ctx, modelList, client.InNamespace(provider.Namespace)); err != nil { + if err := r.List(ctx, modelList, + client.InNamespace(provider.Namespace), + client.MatchingFields{"spec.externalProviderRefs.ref.name": provider.Name}, + ); err != nil { return nil }As per coding guidelines "REVIEW PRIORITIES: ... 4. Performance problems".
Also applies to: 221-228
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/externalmodel/reconciler.go` around lines 199 - 218, The current mapProviderToModels implementation does an expensive namespace-wide List and scan; replace it with a field index on ExternalModel external provider names and query via MatchingFields. Add an index in the controller setup (e.g., in SetupWithManager or SetupIndexes) using mgr.GetFieldIndexer().IndexField(ctx, &v1alpha1.ExternalModel{}, "externalProviderRefNames", func(obj client.Object) []string { extract and return each ref.Ref.Name from obj.(*ExternalModel).Spec.ExternalProviderRefs }), then change mapProviderToModels to List only models matching client.InNamespace(provider.Namespace) and client.MatchingFields(map[string]string{"externalProviderRefNames": provider.Name}) (remove the manual loop scanning refs). Repeat the same change for the other similar block referenced (lines ~221-228).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.gitignore:
- Line 3: The .gitignore currently has an unanchored rule "controller" which
ignores any path segment named controller; update that line to anchor it to the
repository root by replacing the rule with "/controller" so only the top-level
binary artifact is ignored (locate the existing "controller" entry and change it
to "/controller").
In `@api/inference/v1alpha1/externalmodel_types.go`:
- Around line 42-47: The ExternalProviderRefs slice on the ExternalModel spec
lacks an upper bound, allowing unbounded expansion of HTTPRoute rules; add a
kubebuilder validation annotation to limit items (e.g., add `//
+kubebuilder:validation:MaxItems=10`) above the ExternalProviderRefs declaration
in externalmodel_types.go so the field `ExternalProviderRefs
[]ExternalProviderRef` is constrained; ensure the annotation uses the same
spelling/casing as the existing `+kubebuilder:validation:Required`/`MinItems`
comments so the CRD generator picks up the MaxItems limit.
In `@api/inference/v1alpha1/types_test.go`:
- Around line 63-83: The TestExternalModelDeepCopy currently only checks
string-field mutation; add verification for pointer-field aliasing by
initializing ExternalProviderRef.Weight (a *int32) on original.Spec (in
TestExternalModelDeepCopy), call original.DeepCopy(), mutate the
copied.Spec.ExternalProviderRefs[0].Weight value (e.g., change the int32 via
pointer deref) and assert that original.Spec.ExternalProviderRefs[0].Weight
still has the original value; ensure you allocate a separate int32 for the
original and copied objects to detect shallow-copy regressions in
ExternalModel.DeepCopy.
In `@config/crd/bases/inference.opendatahub.io_externalmodels.yaml`:
- Around line 17-23: The additionalPrinterColumns currently hardcode
.spec.externalProviderRefs[0] which only shows the first provider; update the
two columns under additionalPrinterColumns to aggregate all entries by replacing
.spec.externalProviderRefs[0].targetModel and
.spec.externalProviderRefs[0].ref.name with jsonPaths that iterate the array
(e.g. .spec.externalProviderRefs[*].targetModel and
.spec.externalProviderRefs[*].ref.name) so kubectl prints all providers/targets,
or remove these two columns entirely if you prefer not to expose aggregated
provider info; adjust the jsonPath entries for additionalPrinterColumns
accordingly.
In `@deploy/controller/deployment.yaml`:
- Around line 27-30: The role currently grants cluster-wide "list" and "watch"
on Secrets (the rule with resources: ["secrets"]), which is excessive; update
the rule for Secret validation used by the controller to only allow the minimal
verb(s) needed (e.g., "get") by removing "list" and "watch" from the verbs array
on that secrets rule in deployment.yaml (or scope to specific resourceNames if a
fixed secret must be accessed), leaving the rest of the role unchanged.
In `@Dockerfile.controller`:
- Line 25: The base image in the Dockerfile is using an unstable ':latest' tag
in the FROM statement (FROM --platform=$TARGETPLATFORM
registry.access.redhat.com/ubi9/ubi-minimal:latest); replace this with a pinned,
immutable reference — either a specific, versioned tag (e.g., ubi9:X.Y.Z) or
preferably an image digest (sha256) — so the controller runtime base image is
fixed and repeatable; update the FROM line to use that exact tag or digest and
record the chosen reference in your image release notes.
In `@docs/plans/2026-04-27-split-external-provider-prs.md`:
- Around line 42-47: Update the fenced code block containing the tree diagram
(the block that currently begins with ```) to declare a language identifier by
changing the opening fence to ```text so the diagram is treated as text and
satisfies MD040; leave the block content unchanged except for adding the
language token.
In `@pkg/controller/externalmodel/reconciler.go`:
- Around line 96-99: The reconcileHTTPRoute call currently converts
dependency/spec validation failures into a returned error causing retries;
instead, detect specific validation errors (missing refs, provider not found,
provider not-ready) inside reconciler.go (in reconcileHTTPRoute and the
subsequent checks around lines where reconcileHTTPRoute is invoked and status is
set) and handle them as non-fatal: call r.setStatus(ctx, logger, model,
"Failed", metav1.ConditionFalse, "<SpecificReason>", err.Error()) and return
ctrl.Result{RequeueAfter: <optionalDuration>} with nil error (or ctrl.Result{}
with nil error for non-transient cases) rather than returning err. Update all
similar blocks (including the region flagged around 106-123) so
provider/dependency readiness errors do not bubble as reconcile failures but
only update status and control requeue behavior. Ensure specific reason strings
differentiate missing refs vs provider-not-found vs not-ready for observability.
- Around line 158-169: The setStatus function currently always calls
r.Status().Update which causes needless API writes; change setStatus to first
compute the desired phase and desired metav1.Condition (same fields used with
meta.SetStatusCondition), then compare the existing model.Status.Phase and
model.Status.Conditions to the desired values (use a deep equality check such as
equality.Semantic.DeepEqual or reflect.DeepEqual on the Conditions slice) and
return early if nothing changed; only call meta.SetStatusCondition and
r.Status().Update when the phase or conditions differ (ensure ObservedGeneration
is considered in the comparison so updates that only change generation still
trigger an update).
In `@pkg/controller/externalmodel/resources.go`:
- Around line 95-123: The provider-specific HTTPRouteRule created inside the
providers loop in resources.go only matches on the X-Selected-Provider header;
update the Matches for each gatewayapiv1.HTTPRouteRule (the rule built inside
"for _, p := range providers") to include the same PathPrefix HTTPRouteMatch
used by the initial/default rule so provider rules also constrain by request
path; add a second match entry (or augment the existing HTTPRouteMatch) that
uses gatewayapiv1.HTTPRouteMatch.PathPrefix with the identical path
value/variable used by the initial rule to prevent cross-model routing.
In `@pkg/controller/externalprovider/reconciler.go`:
- Around line 130-141: The validateSecretRef method currently only checks Secret
existence; modify Reconciler.validateSecretRef to fetch the Secret into a
corev1.Secret (instead of passing an empty struct), then verify the Secret.Data
(or StringData) contains a non-empty "api-key" entry for
provider.Spec.Auth.SecretRef.Name in provider.Namespace and return a descriptive
error (e.g., "secret %q missing non-empty api-key") when absent or empty; keep
the existing not-found and Get error handling but add this credential
presence/emptiness validation before returning nil.
- Line 49: The RBAC marker granting Secrets access is too permissive; update the
kubebuilder RBAC comment that currently includes "verbs=get;list;watch" to only
include "verbs=get" for secrets. Locate the RBAC line near the top of the
reconciler (the marker above the reconciler type) and change the verbs to "get"
only—this aligns with the reconciler's validateSecretRef method which only calls
r.Get(ctx, key, &corev1.Secret{}) and avoids adding unnecessary list/watch
permissions.
In `@pkg/plugins/model-provider-resolver/external_model_reconciler.go`:
- Around line 81-85: When providerStore.get(providerKey) returns not found,
delete the cached externalModelInfo for this model from modelStore before
returning RequeueAfter so we fail closed; locate the branch around
providerStore.get(providerKey) / logger.Info(...) and add a call to remove the
model's cached entry from modelStore (e.g., modelStore.delete(...) or
modelStore.remove(...) for the same key used to store externalModelInfo)
immediately before returning ctrl.Result{RequeueAfter: 2 * time.Second}, nil so
ProcessRequest no longer routes using stale provider data.
In `@pkg/plugins/model-provider-resolver/plugin_test.go`:
- Around line 179-187: The test's externalModelInfo instance never sets
totalWeight so selectProvider falls back to refs[0] and never exercises weighted
selection; set externalModelInfo.totalWeight to the sum of the providerRef
weights (e.g., 100 + 0 = 100) when constructing the test data so selectProvider
runs the weighted logic. Update the test's externalModelInfo (the struct used
with selectProvider and providerRef entries) to initialize totalWeight
appropriately to reflect the weights in the refs.
In `@pkg/plugins/model-provider-resolver/plugin.go`:
- Around line 92-109: The loop only inspects refs[0] so models that reference
the provider at index >0 won't be requeued; update the loop over modelList.Items
to iterate all entries in the extracted refs slice (for example: for j := range
refs { ... }) and for each entry perform the map[string]any type assertion, call
nestedString(refMap, "ref", "name") and compare to providerName, and if it
matches append the reconcile.Request (same NamespacedName logic) to requests;
ensure you continue on non-map entries and do not assume refs[0] exists.
In `@pkg/plugins/model-provider-resolver/provider_store.go`:
- Around line 44-60: The store currently saves caller-owned *providerInfo
pointers in addOrUpdate and returns internal pointers from get, allowing callers
to mutate providerInfo (notably the config map) outside s.lock and cause races;
fix by making and storing a defensive deep copy of providerInfo (including a
cloned config map) inside addOrUpdate while holding s.lock, and have get return
a copy (not the internal pointer) created while holding s.lock (implement a
helper like cloneProviderInfo to copy fields and maps so no caller-owned pointer
escapes the lock on read or write).
---
Nitpick comments:
In `@deploy/controller/samples/external-provider-katan.yaml`:
- Around line 1-24: Add a clear header comment at the top of the YAML indicating
these resources are sample/test/local-development only and not real production
endpoints; update the file containing ExternalProvider resources (look for kind:
ExternalProvider and metadata names katan-openai-provider and
katan-vertex-provider) to include a short comment block explaining that endpoint
3-13-21-181.sslip.io is a placeholder resolving to 3.13.21.181 and should not be
used in production or committed without change, and optionally suggest replacing
with real endpoints or secrets for deployment.
In `@Makefile`:
- Around line 71-77: The recipes for the generate and manifests targets use the
unquoted shell variable $(CONTROLLER_GEN), which breaks when the repo path
contains spaces and violates the Makefile rule to quote shell variables; update
the commands in the generate and manifests targets to quote the variable (i.e.,
use "$(CONTROLLER_GEN)" in both the generate recipe and the manifests recipe) so
the command is treated as a single word and works with paths containing spaces.
- Around line 79-81: The Makefile defines a verify-codegen target but the main
verify target doesn't depend on it, so CI/local `make verify` skips codegen
checks; update the Makefile so the verify target depends on verify-codegen (add
verify-codegen to the dependency list for the verify target) so that calling
`make verify` will run the `verify-codegen` rule (target names: verify,
verify-codegen).
In `@pkg/controller/externalmodel/reconciler.go`:
- Around line 199-218: The current mapProviderToModels implementation does an
expensive namespace-wide List and scan; replace it with a field index on
ExternalModel external provider names and query via MatchingFields. Add an index
in the controller setup (e.g., in SetupWithManager or SetupIndexes) using
mgr.GetFieldIndexer().IndexField(ctx, &v1alpha1.ExternalModel{},
"externalProviderRefNames", func(obj client.Object) []string { extract and
return each ref.Ref.Name from obj.(*ExternalModel).Spec.ExternalProviderRefs }),
then change mapProviderToModels to List only models matching
client.InNamespace(provider.Namespace) and
client.MatchingFields(map[string]string{"externalProviderRefNames":
provider.Name}) (remove the manual loop scanning refs). Repeat the same change
for the other similar block referenced (lines ~221-228).
In `@pkg/plugins/model-provider-resolver/provider_store_test.go`:
- Around line 75-93: Update TestProviderStore_ConcurrentAccess to assert
invariants after wg.Wait: call newProviderInfoStore via existing variable
`store` and for each possible key used in the goroutines (e.g., NamespacedName
with Namespace "ns" and Name fmt.Sprintf("prov-%d", i) for i in 0..9) call
`store.get(key)` and add assertions that if the returned pointer is non-nil then
its `provider` field is not empty; use testing helpers (t.Fatalf/t.Errorf) to
fail the test on violation. Ensure you reference the existing methods
`addOrUpdate`, `get`, `delete` and the `providerInfo` struct when adding these
checks so regressions become detectable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Central YAML (base), Organization UI (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 14631011-00f1-4ee0-bdb8-be64e66391bd
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum,!**/*.sum
📒 Files selected for processing (42)
.gitignoreDockerfile.controllerMakefileapi/inference/v1alpha1/externalmodel_types.goapi/inference/v1alpha1/externalprovider_types.goapi/inference/v1alpha1/groupversion_info.goapi/inference/v1alpha1/types_test.goapi/inference/v1alpha1/zz_generated.deepcopy.gocmd/controller/main.goconfig/crd/bases/inference.opendatahub.io_externalmodels.yamlconfig/crd/bases/inference.opendatahub.io_externalproviders.yamldeploy/controller/deployment.yamldeploy/controller/samples/external-model-katan.yamldeploy/controller/samples/external-provider-katan.yamldeploy/payload-processing/templates/rbac.yamldocs/plans/2026-04-27-split-external-provider-prs.mdgo.modpkg/controller/common/constants.gopkg/controller/externalmodel/reconciler.gopkg/controller/externalmodel/reconciler_test.gopkg/controller/externalmodel/resources.gopkg/controller/externalmodel/resources_test.gopkg/controller/externalmodel/testdata/gateway-api-crds/httproute-crd.yamlpkg/controller/externalprovider/reconciler.gopkg/controller/externalprovider/reconciler_test.gopkg/controller/externalprovider/resources.gopkg/controller/externalprovider/resources_test.gopkg/controller/externalprovider/testdata/istio-crds/destinationrule-crd.yamlpkg/controller/externalprovider/testdata/istio-crds/serviceentry-crd.yamlpkg/plugins/api-translation/plugin.gopkg/plugins/api-translation/plugin_test.gopkg/plugins/common/state/state-keys.gopkg/plugins/model-provider-resolver/external_model_reconciler.gopkg/plugins/model-provider-resolver/external_model_reconciler_test.gopkg/plugins/model-provider-resolver/external_provider_reconciler.gopkg/plugins/model-provider-resolver/external_provider_reconciler_test.gopkg/plugins/model-provider-resolver/model_store.gopkg/plugins/model-provider-resolver/model_store_test.gopkg/plugins/model-provider-resolver/plugin.gopkg/plugins/model-provider-resolver/plugin_test.gopkg/plugins/model-provider-resolver/provider_store.gopkg/plugins/model-provider-resolver/provider_store_test.go
| @@ -1,2 +1,3 @@ | |||
| # Binaries for programs and plugins | |||
| bin/* | |||
| controller | |||
There was a problem hiding this comment.
Line 3: anchor the ignore rule to repo root.
controller matches any path segment named controller; use /controller to target only the root binary artifact.
Suggested patch
-controller
+/controller📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| controller | |
| /controller |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.gitignore at line 3, The .gitignore currently has an unanchored rule
"controller" which ignores any path segment named controller; update that line
to anchor it to the repository root by replacing the rule with "/controller" so
only the top-level binary artifact is ignored (locate the existing "controller"
entry and change it to "/controller").
| // ExternalProviderRefs maps this model to one or more external providers. | ||
| // Each entry specifies the provider, provider-side model name, API format, | ||
| // and an optional weight for traffic splitting. | ||
| // +kubebuilder:validation:Required | ||
| // +kubebuilder:validation:MinItems=1 | ||
| ExternalProviderRefs []ExternalProviderRef `json:"externalProviderRefs"` |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Does Kubebuilder/controller-tools support +kubebuilder:validation:MaxItems on slice fields in CRD schemas, and is it enforced by generated CRDs?
💡 Result:
Yes, Kubebuilder/controller-tools supports +kubebuilder:validation:MaxItems on slice fields in CRD schemas, and it is enforced by the generated CRDs. The marker generates the corresponding maxItems property in the OpenAPI v3 validation schema of the CRD, which Kubernetes apiserver enforces during create/update operations.
Citations:
- 1: https://book.kubebuilder.io/reference/markers/crd-validation.html
- 2: https://github.com/kubernetes-sigs/controller-tools/blob/master/pkg/crd/markers/validation.go
- 3: https://book.kubebuilder.io/reference/generating-crd.html?highlight=print
- 4: https://book.kubebuilder.io/reference/generating-crd.html
- 5: https://book-v1.book.kubebuilder.io/beyond_basics/generating_crd.html
- 6: https://book-v3.book.kubebuilder.io/reference/markers/crd-validation
🏁 Script executed:
# Check the ExternalModel type and ExternalProviderRef struct
fd -t f 'externalmodel_types.go' --exec cat -n {} \;Repository: opendatahub-io/ai-gateway-payload-processing
Length of output: 5319
🏁 Script executed:
# Check how ExternalProviderRefs is used in the controller
fd -t f 'resources.go' api/inference --exec grep -n 'ExternalProviderRefs' {} +Repository: opendatahub-io/ai-gateway-payload-processing
Length of output: 70
🏁 Script executed:
# Search for HTTPRoute generation logic in controller code
fd -t f -e go 'externalmodel' pkg/controller --exec grep -l 'HTTPRoute' {} \;Repository: opendatahub-io/ai-gateway-payload-processing
Length of output: 70
🏁 Script executed:
# Check for other unbounded slices in the same types file
fd -t f 'externalmodel_types.go' --exec rg '\[\]' {} -n \;Repository: opendatahub-io/ai-gateway-payload-processing
Length of output: 282
🏁 Script executed:
# Find controller directories and files for externalmodel
fd -t f -e go pkg/controller | head -20Repository: opendatahub-io/ai-gateway-payload-processing
Length of output: 450
🏁 Script executed:
# Search for where ExternalProviderRef is used in the codebase
rg 'ExternalProviderRef' -t go -lRepository: opendatahub-io/ai-gateway-payload-processing
Length of output: 219
🏁 Script executed:
# Look for any controller that processes ExternalModel spec
rg 'ExternalModel' pkg/controller -t go -l | head -5Repository: opendatahub-io/ai-gateway-payload-processing
Length of output: 268
🏁 Script executed:
# Examine the resources.go file to see how ExternalProviderRefs is processed
wc -l pkg/controller/externalmodel/resources.goRepository: opendatahub-io/ai-gateway-payload-processing
Length of output: 134
🏁 Script executed:
# Get the content of resources.go
cat -n pkg/controller/externalmodel/resources.goRepository: opendatahub-io/ai-gateway-payload-processing
Length of output: 5591
Add MaxItems to externalProviderRefs to prevent unbounded HTTPRoute expansion.
Each provider ref expands into an HTTPRoute rule (see pkg/controller/externalmodel/resources.go line 96-124); without an upper bound, a single CR can force the controller to generate an arbitrarily large rule set, causing resource exhaustion (CWE-770). Per coding guidelines (**/api/**/*_types.go), add MaxItems validation.
Suggested fix
// +kubebuilder:validation:Required
// +kubebuilder:validation:MinItems=1
+ // +kubebuilder:validation:MaxItems=16
ExternalProviderRefs []ExternalProviderRef `json:"externalProviderRefs"`📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // ExternalProviderRefs maps this model to one or more external providers. | |
| // Each entry specifies the provider, provider-side model name, API format, | |
| // and an optional weight for traffic splitting. | |
| // +kubebuilder:validation:Required | |
| // +kubebuilder:validation:MinItems=1 | |
| ExternalProviderRefs []ExternalProviderRef `json:"externalProviderRefs"` | |
| // ExternalProviderRefs maps this model to one or more external providers. | |
| // Each entry specifies the provider, provider-side model name, API format, | |
| // and an optional weight for traffic splitting. | |
| // +kubebuilder:validation:Required | |
| // +kubebuilder:validation:MinItems=1 | |
| // +kubebuilder:validation:MaxItems=16 | |
| ExternalProviderRefs []ExternalProviderRef `json:"externalProviderRefs"` |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/inference/v1alpha1/externalmodel_types.go` around lines 42 - 47, The
ExternalProviderRefs slice on the ExternalModel spec lacks an upper bound,
allowing unbounded expansion of HTTPRoute rules; add a kubebuilder validation
annotation to limit items (e.g., add `// +kubebuilder:validation:MaxItems=10`)
above the ExternalProviderRefs declaration in externalmodel_types.go so the
field `ExternalProviderRefs []ExternalProviderRef` is constrained; ensure the
annotation uses the same spelling/casing as the existing
`+kubebuilder:validation:Required`/`MinItems` comments so the CRD generator
picks up the MaxItems limit.
| func TestExternalModelDeepCopy(t *testing.T) { | ||
| original := &ExternalModel{ | ||
| Spec: ExternalModelSpec{ | ||
| ExternalProviderRefs: []ExternalProviderRef{ | ||
| { | ||
| Ref: NameReference{Name: "my-openai"}, | ||
| TargetModel: "gpt-4o", | ||
| APIFormat: "openai-chat", | ||
| }, | ||
| }, | ||
| }, | ||
| } | ||
|
|
||
| copied := original.DeepCopy() | ||
|
|
||
| assert.Equal(t, original.Spec, copied.Spec) | ||
|
|
||
| // Verify deep copy — mutating the copy must not affect the original | ||
| copied.Spec.ExternalProviderRefs[0].TargetModel = "gpt-3.5" | ||
| assert.Equal(t, "gpt-4o", original.Spec.ExternalProviderRefs[0].TargetModel) | ||
| } |
There was a problem hiding this comment.
DeepCopy test misses pointer-field aliasing on Weight (Line [63]).
ExternalProviderRef.Weight is a *int32, but this test only validates string field mutation. Add a pointer mutation assertion to catch shallow-copy regressions.
Proposed test hardening
func TestExternalModelDeepCopy(t *testing.T) {
+ w := int32(7)
original := &ExternalModel{
Spec: ExternalModelSpec{
ExternalProviderRefs: []ExternalProviderRef{
{
Ref: NameReference{Name: "my-openai"},
TargetModel: "gpt-4o",
APIFormat: "openai-chat",
+ Weight: &w,
},
},
},
}
@@
copied.Spec.ExternalProviderRefs[0].TargetModel = "gpt-3.5"
assert.Equal(t, "gpt-4o", original.Spec.ExternalProviderRefs[0].TargetModel)
+
+ *copied.Spec.ExternalProviderRefs[0].Weight = 99
+ require.NotNil(t, original.Spec.ExternalProviderRefs[0].Weight)
+ assert.Equal(t, int32(7), *original.Spec.ExternalProviderRefs[0].Weight)
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func TestExternalModelDeepCopy(t *testing.T) { | |
| original := &ExternalModel{ | |
| Spec: ExternalModelSpec{ | |
| ExternalProviderRefs: []ExternalProviderRef{ | |
| { | |
| Ref: NameReference{Name: "my-openai"}, | |
| TargetModel: "gpt-4o", | |
| APIFormat: "openai-chat", | |
| }, | |
| }, | |
| }, | |
| } | |
| copied := original.DeepCopy() | |
| assert.Equal(t, original.Spec, copied.Spec) | |
| // Verify deep copy — mutating the copy must not affect the original | |
| copied.Spec.ExternalProviderRefs[0].TargetModel = "gpt-3.5" | |
| assert.Equal(t, "gpt-4o", original.Spec.ExternalProviderRefs[0].TargetModel) | |
| } | |
| func TestExternalModelDeepCopy(t *testing.T) { | |
| w := int32(7) | |
| original := &ExternalModel{ | |
| Spec: ExternalModelSpec{ | |
| ExternalProviderRefs: []ExternalProviderRef{ | |
| { | |
| Ref: NameReference{Name: "my-openai"}, | |
| TargetModel: "gpt-4o", | |
| APIFormat: "openai-chat", | |
| Weight: &w, | |
| }, | |
| }, | |
| }, | |
| } | |
| copied := original.DeepCopy() | |
| assert.Equal(t, original.Spec, copied.Spec) | |
| // Verify deep copy — mutating the copy must not affect the original | |
| copied.Spec.ExternalProviderRefs[0].TargetModel = "gpt-3.5" | |
| assert.Equal(t, "gpt-4o", original.Spec.ExternalProviderRefs[0].TargetModel) | |
| *copied.Spec.ExternalProviderRefs[0].Weight = 99 | |
| require.NotNil(t, original.Spec.ExternalProviderRefs[0].Weight) | |
| assert.Equal(t, int32(7), *original.Spec.ExternalProviderRefs[0].Weight) | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@api/inference/v1alpha1/types_test.go` around lines 63 - 83, The
TestExternalModelDeepCopy currently only checks string-field mutation; add
verification for pointer-field aliasing by initializing
ExternalProviderRef.Weight (a *int32) on original.Spec (in
TestExternalModelDeepCopy), call original.DeepCopy(), mutate the
copied.Spec.ExternalProviderRefs[0].Weight value (e.g., change the int32 via
pointer deref) and assert that original.Spec.ExternalProviderRefs[0].Weight
still has the original value; ensure you allocate a separate int32 for the
original and copied objects to detect shallow-copy regressions in
ExternalModel.DeepCopy.
| - additionalPrinterColumns: | ||
| - jsonPath: .spec.externalProviderRefs[0].targetModel | ||
| name: TargetModel | ||
| type: string | ||
| - jsonPath: .spec.externalProviderRefs[0].ref.name | ||
| name: Provider | ||
| type: string |
There was a problem hiding this comment.
These printer columns are now misleading for multi-provider models.
Both columns hardcode externalProviderRefs[0], so kubectl get externalmodels reports only the first provider/ref even when the model is split across several backends. Either remove them or replace them with aggregate columns.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@config/crd/bases/inference.opendatahub.io_externalmodels.yaml` around lines
17 - 23, The additionalPrinterColumns currently hardcode
.spec.externalProviderRefs[0] which only shows the first provider; update the
two columns under additionalPrinterColumns to aggregate all entries by replacing
.spec.externalProviderRefs[0].targetModel and
.spec.externalProviderRefs[0].ref.name with jsonPaths that iterate the array
(e.g. .spec.externalProviderRefs[*].targetModel and
.spec.externalProviderRefs[*].ref.name) so kubectl prints all providers/targets,
or remove these two columns entirely if you prefer not to expose aggregated
provider info; adjust the jsonPath entries for additionalPrinterColumns
accordingly.
| # Secrets (validate secretRef) | ||
| - apiGroups: [""] | ||
| resources: ["secrets"] | ||
| verbs: ["get", "list", "watch"] |
There was a problem hiding this comment.
Major: drop cluster-wide secret listing/watching from this role (CWE-250).
The controller only appears to validate a referenced Secret by name. Granting list and watch lets a compromised controller enumerate Secret objects across all namespaces, which is unnecessary privilege for this code path.
Suggested fix
- apiGroups: [""]
resources: ["secrets"]
- verbs: ["get", "list", "watch"]
+ verbs: ["get"]📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Secrets (validate secretRef) | |
| - apiGroups: [""] | |
| resources: ["secrets"] | |
| verbs: ["get", "list", "watch"] | |
| # Secrets (validate secretRef) | |
| - apiGroups: [""] | |
| resources: ["secrets"] | |
| verbs: ["get"] |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@deploy/controller/deployment.yaml` around lines 27 - 30, The role currently
grants cluster-wide "list" and "watch" on Secrets (the rule with resources:
["secrets"]), which is excessive; update the rule for Secret validation used by
the controller to only allow the minimal verb(s) needed (e.g., "get") by
removing "list" and "watch" from the verbs array on that secrets rule in
deployment.yaml (or scope to specific resourceNames if a fixed secret must be
accessed), leaving the rest of the role unchanged.
| func (r *Reconciler) validateSecretRef(ctx context.Context, provider *inferencev1alpha1.ExternalProvider) error { | ||
| key := types.NamespacedName{ | ||
| Name: provider.Spec.Auth.SecretRef.Name, | ||
| Namespace: provider.Namespace, | ||
| } | ||
| if err := r.Get(ctx, key, &corev1.Secret{}); err != nil { | ||
| if apierrors.IsNotFound(err) { | ||
| return fmt.Errorf("secret %q not found in namespace %q", key.Name, key.Namespace) | ||
| } | ||
| return fmt.Errorf("failed to get secret %q: %w", key.Name, err) | ||
| } | ||
| return nil |
There was a problem hiding this comment.
Fail reconciliation when the Secret lacks a non-empty api-key.
This only checks that the Secret exists. A Secret without data["api-key"] still lets the provider reach Ready, and downstream auth injection will then fail every request at runtime. Validate the required credential key here instead (CWE-20).
Suggested fix
func (r *Reconciler) validateSecretRef(ctx context.Context, provider *inferencev1alpha1.ExternalProvider) error {
key := types.NamespacedName{
Name: provider.Spec.Auth.SecretRef.Name,
Namespace: provider.Namespace,
}
- if err := r.Get(ctx, key, &corev1.Secret{}); err != nil {
+ secret := &corev1.Secret{}
+ if err := r.Get(ctx, key, secret); err != nil {
if apierrors.IsNotFound(err) {
return fmt.Errorf("secret %q not found in namespace %q", key.Name, key.Namespace)
}
return fmt.Errorf("failed to get secret %q: %w", key.Name, err)
}
+ apiKey, ok := secret.Data["api-key"]
+ if !ok || len(apiKey) == 0 {
+ return fmt.Errorf("secret %q in namespace %q must contain non-empty data[\"api-key\"]", key.Name, key.Namespace)
+ }
return nil
}As per coding guidelines, **/*.go: 4. Validate CR spec fields before using in ConfigMaps/Secrets.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (r *Reconciler) validateSecretRef(ctx context.Context, provider *inferencev1alpha1.ExternalProvider) error { | |
| key := types.NamespacedName{ | |
| Name: provider.Spec.Auth.SecretRef.Name, | |
| Namespace: provider.Namespace, | |
| } | |
| if err := r.Get(ctx, key, &corev1.Secret{}); err != nil { | |
| if apierrors.IsNotFound(err) { | |
| return fmt.Errorf("secret %q not found in namespace %q", key.Name, key.Namespace) | |
| } | |
| return fmt.Errorf("failed to get secret %q: %w", key.Name, err) | |
| } | |
| return nil | |
| func (r *Reconciler) validateSecretRef(ctx context.Context, provider *inferencev1alpha1.ExternalProvider) error { | |
| key := types.NamespacedName{ | |
| Name: provider.Spec.Auth.SecretRef.Name, | |
| Namespace: provider.Namespace, | |
| } | |
| secret := &corev1.Secret{} | |
| if err := r.Get(ctx, key, secret); err != nil { | |
| if apierrors.IsNotFound(err) { | |
| return fmt.Errorf("secret %q not found in namespace %q", key.Name, key.Namespace) | |
| } | |
| return fmt.Errorf("failed to get secret %q: %w", key.Name, err) | |
| } | |
| apiKey, ok := secret.Data["api-key"] | |
| if !ok || len(apiKey) == 0 { | |
| return fmt.Errorf("secret %q in namespace %q must contain non-empty data[\"api-key\"]", key.Name, key.Namespace) | |
| } | |
| return nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/controller/externalprovider/reconciler.go` around lines 130 - 141, The
validateSecretRef method currently only checks Secret existence; modify
Reconciler.validateSecretRef to fetch the Secret into a corev1.Secret (instead
of passing an empty struct), then verify the Secret.Data (or StringData)
contains a non-empty "api-key" entry for provider.Spec.Auth.SecretRef.Name in
provider.Namespace and return a descriptive error (e.g., "secret %q missing
non-empty api-key") when absent or empty; keep the existing not-found and Get
error handling but add this credential presence/emptiness validation before
returning nil.
| provInfo, found := r.providerStore.get(providerKey) | ||
| if !found { | ||
| logger.Info("ExternalProvider not yet available, requeueing", "provider", providerRefName) | ||
| return ctrl.Result{RequeueAfter: 2 * time.Second}, nil | ||
| } |
There was a problem hiding this comment.
Delete the cached model entry before requeueing a missing provider.
If this model was already resolved once, Line 84 returns RequeueAfter but leaves the old externalModelInfo in modelStore. ProcessRequest will keep routing with stale provider data until a later successful reconcile. Fail closed here.
Suggested fix
providerKey := types.NamespacedName{Namespace: req.Namespace, Name: providerRefName}
provInfo, found := r.providerStore.get(providerKey)
if !found {
+ r.modelStore.deleteExternalModel(req.NamespacedName)
logger.Info("ExternalProvider not yet available, requeueing", "provider", providerRefName)
return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| provInfo, found := r.providerStore.get(providerKey) | |
| if !found { | |
| logger.Info("ExternalProvider not yet available, requeueing", "provider", providerRefName) | |
| return ctrl.Result{RequeueAfter: 2 * time.Second}, nil | |
| } | |
| providerKey := types.NamespacedName{Namespace: req.Namespace, Name: providerRefName} | |
| provInfo, found := r.providerStore.get(providerKey) | |
| if !found { | |
| r.modelStore.deleteExternalModel(req.NamespacedName) | |
| logger.Info("ExternalProvider not yet available, requeueing", "provider", providerRefName) | |
| return ctrl.Result{RequeueAfter: 2 * time.Second}, nil | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/plugins/model-provider-resolver/external_model_reconciler.go` around
lines 81 - 85, When providerStore.get(providerKey) returns not found, delete the
cached externalModelInfo for this model from modelStore before returning
RequeueAfter so we fail closed; locate the branch around
providerStore.get(providerKey) / logger.Info(...) and add a call to remove the
model's cached entry from modelStore (e.g., modelStore.delete(...) or
modelStore.remove(...) for the same key used to store externalModelInfo)
immediately before returning ctrl.Result{RequeueAfter: 2 * time.Second}, nil so
ProcessRequest no longer routes using stale provider data.
| &externalModelInfo{ | ||
| modelName: "gpt4", | ||
| refs: []providerRef{ | ||
| {providerName: "openai-provider", provider: "openai", targetModel: "gpt-4o", | ||
| secretName: "k1", secretNamespace: "llm", apiFormat: "openai", weight: 100}, | ||
| {providerName: "bedrock-provider", provider: "bedrock-openai", targetModel: "gpt-4o-bedrock", | ||
| secretName: "k2", secretNamespace: "llm", apiFormat: "bedrock-openai", weight: 0}, | ||
| }, | ||
| }, |
There was a problem hiding this comment.
Initialize totalWeight, or this test never exercises weighted selection.
pkg/plugins/model-provider-resolver/model_store.go:42-63 only uses weights when externalModelInfo.totalWeight > 0. Here it stays zero, so selectProvider() falls back to refs[0] and the test passes even if weighted routing is broken.
Suggested fix
&externalModelInfo{
modelName: "gpt4",
+ totalWeight: 100,
refs: []providerRef{
{providerName: "openai-provider", provider: "openai", targetModel: "gpt-4o",
secretName: "k1", secretNamespace: "llm", apiFormat: "openai", weight: 100},
{providerName: "bedrock-provider", provider: "bedrock-openai", targetModel: "gpt-4o-bedrock",
secretName: "k2", secretNamespace: "llm", apiFormat: "bedrock-openai", weight: 0},
},
},📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| &externalModelInfo{ | |
| modelName: "gpt4", | |
| refs: []providerRef{ | |
| {providerName: "openai-provider", provider: "openai", targetModel: "gpt-4o", | |
| secretName: "k1", secretNamespace: "llm", apiFormat: "openai", weight: 100}, | |
| {providerName: "bedrock-provider", provider: "bedrock-openai", targetModel: "gpt-4o-bedrock", | |
| secretName: "k2", secretNamespace: "llm", apiFormat: "bedrock-openai", weight: 0}, | |
| }, | |
| }, | |
| &externalModelInfo{ | |
| modelName: "gpt4", | |
| totalWeight: 100, | |
| refs: []providerRef{ | |
| {providerName: "openai-provider", provider: "openai", targetModel: "gpt-4o", | |
| secretName: "k1", secretNamespace: "llm", apiFormat: "openai", weight: 100}, | |
| {providerName: "bedrock-provider", provider: "bedrock-openai", targetModel: "gpt-4o-bedrock", | |
| secretName: "k2", secretNamespace: "llm", apiFormat: "bedrock-openai", weight: 0}, | |
| }, | |
| }, |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/plugins/model-provider-resolver/plugin_test.go` around lines 179 - 187,
The test's externalModelInfo instance never sets totalWeight so selectProvider
falls back to refs[0] and never exercises weighted selection; set
externalModelInfo.totalWeight to the sum of the providerRef weights (e.g., 100 +
0 = 100) when constructing the test data so selectProvider runs the weighted
logic. Update the test's externalModelInfo (the struct used with selectProvider
and providerRef entries) to initialize totalWeight appropriately to reflect the
weights in the refs.
| for i := range modelList.Items { | ||
| refs, _, _ := unstructured.NestedSlice(modelList.Items[i].Object, "spec", "externalProviderRefs") | ||
| if len(refs) == 0 { | ||
| continue | ||
| } | ||
| refMap, ok := refs[0].(map[string]any) | ||
| if !ok { | ||
| continue | ||
| } | ||
| if nestedString(refMap, "ref", "name") == providerName { | ||
| requests = append(requests, reconcile.Request{ | ||
| NamespacedName: types.NamespacedName{ | ||
| Name: modelList.Items[i].GetName(), | ||
| Namespace: modelList.Items[i].GetNamespace(), | ||
| }, | ||
| }) | ||
| } | ||
| } |
There was a problem hiding this comment.
Scan all externalProviderRefs when mapping provider updates.
Line 97 only checks refs[0]. Any model that references this provider in position 2+ will never be re-enqueued on provider changes, so its cached secret/config stays stale.
Suggested fix
var requests []reconcile.Request
for i := range modelList.Items {
refs, _, _ := unstructured.NestedSlice(modelList.Items[i].Object, "spec", "externalProviderRefs")
if len(refs) == 0 {
continue
}
- refMap, ok := refs[0].(map[string]any)
- if !ok {
- continue
- }
- if nestedString(refMap, "ref", "name") == providerName {
- requests = append(requests, reconcile.Request{
- NamespacedName: types.NamespacedName{
- Name: modelList.Items[i].GetName(),
- Namespace: modelList.Items[i].GetNamespace(),
- },
- })
+ for _, rawRef := range refs {
+ refMap, ok := rawRef.(map[string]any)
+ if !ok {
+ continue
+ }
+ if nestedString(refMap, "ref", "name") != providerName {
+ continue
+ }
+ requests = append(requests, reconcile.Request{
+ NamespacedName: types.NamespacedName{
+ Name: modelList.Items[i].GetName(),
+ Namespace: modelList.Items[i].GetNamespace(),
+ },
+ })
+ break
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/plugins/model-provider-resolver/plugin.go` around lines 92 - 109, The
loop only inspects refs[0] so models that reference the provider at index >0
won't be requeued; update the loop over modelList.Items to iterate all entries
in the extracted refs slice (for example: for j := range refs { ... }) and for
each entry perform the map[string]any type assertion, call nestedString(refMap,
"ref", "name") and compare to providerName, and if it matches append the
reconcile.Request (same NamespacedName logic) to requests; ensure you continue
on non-map entries and do not assume refs[0] exists.
| func (s *providerInfoStore) addOrUpdate(key types.NamespacedName, info *providerInfo) { | ||
| s.lock.Lock() | ||
| defer s.lock.Unlock() | ||
| s.providers[key.String()] = info | ||
| } | ||
|
|
||
| func (s *providerInfoStore) delete(key types.NamespacedName) { | ||
| s.lock.Lock() | ||
| defer s.lock.Unlock() | ||
| delete(s.providers, key.String()) | ||
| } | ||
|
|
||
| func (s *providerInfoStore) get(key types.NamespacedName) (*providerInfo, bool) { | ||
| s.lock.RLock() | ||
| defer s.lock.RUnlock() | ||
| info, ok := s.providers[key.String()] | ||
| return info, ok |
There was a problem hiding this comment.
Lines 44-60: mutable pointers escape the lock boundary (CWE-362).
addOrUpdate stores caller-owned pointers and get returns internal pointers. Any downstream mutation (especially config map writes) can race outside RWMutex protection and corrupt store state.
Suggested patch
import (
+ "maps"
"sync"
"k8s.io/apimachinery/pkg/types"
)
@@
func (s *providerInfoStore) addOrUpdate(key types.NamespacedName, info *providerInfo) {
s.lock.Lock()
defer s.lock.Unlock()
- s.providers[key.String()] = info
+ s.providers[key.String()] = cloneProviderInfo(info)
}
@@
func (s *providerInfoStore) get(key types.NamespacedName) (*providerInfo, bool) {
s.lock.RLock()
defer s.lock.RUnlock()
info, ok := s.providers[key.String()]
- return info, ok
+ if !ok {
+ return nil, false
+ }
+ return cloneProviderInfo(info), true
}
+
+func cloneProviderInfo(in *providerInfo) *providerInfo {
+ if in == nil {
+ return nil
+ }
+ out := *in
+ out.config = maps.Clone(in.config)
+ return &out
+}As per coding guidelines, **: "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (s *providerInfoStore) addOrUpdate(key types.NamespacedName, info *providerInfo) { | |
| s.lock.Lock() | |
| defer s.lock.Unlock() | |
| s.providers[key.String()] = info | |
| } | |
| func (s *providerInfoStore) delete(key types.NamespacedName) { | |
| s.lock.Lock() | |
| defer s.lock.Unlock() | |
| delete(s.providers, key.String()) | |
| } | |
| func (s *providerInfoStore) get(key types.NamespacedName) (*providerInfo, bool) { | |
| s.lock.RLock() | |
| defer s.lock.RUnlock() | |
| info, ok := s.providers[key.String()] | |
| return info, ok | |
| import ( | |
| "maps" | |
| "sync" | |
| "k8s.io/apimachinery/pkg/types" | |
| ) | |
| func (s *providerInfoStore) addOrUpdate(key types.NamespacedName, info *providerInfo) { | |
| s.lock.Lock() | |
| defer s.lock.Unlock() | |
| s.providers[key.String()] = cloneProviderInfo(info) | |
| } | |
| func (s *providerInfoStore) delete(key types.NamespacedName) { | |
| s.lock.Lock() | |
| defer s.lock.Unlock() | |
| delete(s.providers, key.String()) | |
| } | |
| func (s *providerInfoStore) get(key types.NamespacedName) (*providerInfo, bool) { | |
| s.lock.RLock() | |
| defer s.lock.RUnlock() | |
| info, ok := s.providers[key.String()] | |
| if !ok { | |
| return nil, false | |
| } | |
| return cloneProviderInfo(info), true | |
| } | |
| func cloneProviderInfo(in *providerInfo) *providerInfo { | |
| if in == nil { | |
| return nil | |
| } | |
| out := *in | |
| out.config = maps.Clone(in.config) | |
| return &out | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/plugins/model-provider-resolver/provider_store.go` around lines 44 - 60,
The store currently saves caller-owned *providerInfo pointers in addOrUpdate and
returns internal pointers from get, allowing callers to mutate providerInfo
(notably the config map) outside s.lock and cause races; fix by making and
storing a defensive deep copy of providerInfo (including a cloned config map)
inside addOrUpdate while holding s.lock, and have get return a copy (not the
internal pointer) created while holding s.lock (implement a helper like
cloneProviderInfo to copy fields and maps so no caller-owned pointer escapes the
lock on read or write).
E2E Validation — Multi-Provider Traffic SplittingTested on SetupapiVersion: inference.opendatahub.io/v1alpha1
kind: ExternalModel
metadata:
name: multi-provider-test
namespace: llm
spec:
externalProviderRefs:
- ref: {name: katan-openai-provider}
targetModel: llm-katan-echo
apiFormat: openai
weight: 80
- ref: {name: katan-anthropic-provider}
targetModel: llm-katan-echo
apiFormat: anthropic
weight: 20HTTPRoute created by controller3 rules automatically generated: Traffic distribution (20 requests)Close to configured 80/20 — variance is expected with small sample size. Provider selection is identified by response ID format: Verified behaviors
Larger sample (50 requests, from E2E test PR #214) |
|
Note on file count: This PR shows 43 changed files because it's stacked on top of unmerged PRs (#182 → #183 → #184 → #212). GitHub shows the full diff from main, not just this PR's delta. The actual changes in this PR are ~14 files (CRD Weight field, multi-provider HTTPRoute builder, weighted provider selection in plugin, tests). Once the dependency chain merges, GitHub will automatically recalculate the diff to show only this PR's changes. |
|
@yossiovadia going to close this PR. |
Summary
Enables an ExternalModel to reference multiple ExternalProviders with relative weights for traffic splitting. BBR performs the weighted provider selection, Envoy routes deterministically by header match.
Depends on #182, #183, #184 (Phase 1) and #212 (apiFormat).
How it works
Key design decisions
Changes
Test plan
Summary by CodeRabbit
Release Notes
New Features
Infrastructure