Skip to content

feat: multi-provider refs with weighted traffic splitting#213

Closed
yossiovadia wants to merge 13 commits into
opendatahub-io:mainfrom
yossiovadia:feat/multi-provider-weights
Closed

feat: multi-provider refs with weighted traffic splitting#213
yossiovadia wants to merge 13 commits into
opendatahub-io:mainfrom
yossiovadia:feat/multi-provider-weights

Conversation

@yossiovadia
Copy link
Copy Markdown
Contributor

@yossiovadia yossiovadia commented Apr 29, 2026

Summary

Enables an ExternalModel to reference multiple ExternalProviders with relative weights for traffic splitting. BBR performs the weighted provider selection, Envoy routes deterministically by header match.

Depends on #182, #183, #184 (Phase 1) and #212 (apiFormat).

How it works

  1. ExternalModel lists multiple provider refs with weights:
    externalProviderRefs:
      - ref: {name: openai-provider}
        targetModel: gpt-4o
        apiFormat: openai
        weight: 80
      - ref: {name: bedrock-provider}
        targetModel: gpt-4o-bedrock
        apiFormat: bedrock-openai
        weight: 20
  2. Controller creates HTTPRoute with per-provider rules matched by `X-Selected-Provider` header
  3. BBR model-provider-resolver does weighted random selection, sets the header, rewrites `body.model` to the selected provider's targetModel
  4. Downstream plugins (api-translation, apikey-injection) work unchanged — they read from CycleState

Key design decisions

  • BBR selects, Envoy routes — ext-proc runs before routing, so BBR must choose the provider. Envoy just matches the `X-Selected-Provider` header deterministically.
  • Model name validation — now checks request `body.model` against ExternalModel CR name (client-facing), not targetModel (provider-side). Per Nir's clarification.
  • Body model rewrite — `body.model` is rewritten to the selected provider's targetModel before translation.
  • Precomputed totalWeight — avoids recalculating on every request in the hot path.

Changes

Area Files What
CRD externalmodel_types.go Remove MaxItems=1, add Weight field
Controller reconciler.go, resources.go Multi-provider HTTPRoute with X-Selected-Provider rules
Plugin model_store.go Redesigned for multi-provider (list of providerRef per model) + weighted selection
Plugin external_model_reconciler.go Resolves all provider refs with weights
Plugin plugin.go Weighted selection + header + model rewrite
State state-keys.go Add SelectedProviderKey

Test plan

  • `TestSelectProvider_WeightedDistribution` — 10K iterations, verifies 80/20 distribution
  • `TestSelectProvider_ZeroWeight` — zero-weight provider never selected
  • `TestProcessRequest_MultiProviderSetsHeader` — X-Selected-Provider header set
  • `TestProcessRequest_ModelNameMismatch` — validates CR name not targetModel
  • `TestModelReconciler_MultiProvider` — reconciler resolves multiple refs with weights
  • `TestReconcile_MultiProviderHTTPRoute` (envtest) — 3-rule HTTPRoute with correct backends
  • E2E on maas-experimental: 14 OpenAI / 6 Anthropic out of 20 requests (configured 80/20)
  • Single-provider models still work unchanged

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for connecting to external LLM providers via new provider resources with configurable endpoints and authentication
    • Added external model resources that map model names to providers with weighted request routing and load balancing
    • Introduced automatic request routing based on provider selection and API format compatibility
    • Added controller deployment for managing provider connections and request distribution
  • Infrastructure

    • Added containerized controller image with Kubernetes deployment manifest
    • Added new API types and custom resource definitions for provider and model management
    • Updated build tooling to generate required API schemas and client code
    • Updated RBAC permissions to manage new provider and model resources

Implements inference.opendatahub.io/v1alpha1 API group with two CRDs
per Nir's ExternalProvider proposal:

- ExternalProvider: provider endpoint, credentials, and provider-specific
  config for per-provider settings (Vertex AI project/location/etc.)
- ExternalModel: client-facing model name with provider ref, targetModel,
  apiFormat. MaxItems=1 on externalProviderRefs for Phase 1.

Config field uses map[string]string rather than map[string]any:
- Typed validation in the CRD schema (no x-kubernetes-preserve-unknown-fields)
- Clean JSON round-trip (map[string]any converts numbers to float64)
- Covers all known use cases (project, location, endpoint are strings)
- Can be promoted to a typed struct if nested config is needed later

Adds Makefile targets: generate, manifests, verify-codegen.
- NameReference: add RFC 1123 pattern (lowercase alphanumeric + hyphens)
- Endpoint: require at least one dot to enforce FQDN (reject single-label)
- Bump controller-gen v0.16.4 → v0.20.1 (aligns with controller-runtime v0.23.3)
- Add regex validation tests for both patterns
Address Nir's review:
- apiFormat: changed from optional to required per API evolution rule
  (required → optional is non-breaking, opposite is breaking)
- Phase status: clarified that Pending/Ready/Failed reflect reconciliation
  state (missing provider, missing secret), not runtime request health
…binary

Adds the control-plane controller that watches the new CRDs and creates
Istio networking resources:

ExternalProvider reconciler:
- Creates shared Service, ServiceEntry, DestinationRule per provider
- Validates referenced Secret exists (Phase=Failed/SecretNotFound if missing)
- Idempotent writes, OwnerReference GC, self-healing via Owns(&Service{})

ExternalModel reconciler:
- Creates HTTPRoute per model, backend ref to provider's shared Service
- Cross-watches ExternalProviders — endpoint changes propagate automatically
- Self-healing via Owns(&HTTPRoute{})

Controller binary (cmd/controller):
- --gateway-name/--gateway-namespace flags (same pattern as MaaS controller)
- Health probes, leader election, SecurityContext hardening

Tests: 9 unit + 6 envtest (externalprovider), 3 unit + 5 envtest (externalmodel)
…-first, configurable timeout

1. Istio self-healing: add Watches() for ServiceEntry and DestinationRule
   with label predicate + EnqueueRequestForOwner, so external deletion
   triggers re-reconciliation (same as Service via Owns)

2. ExternalModel checks provider Phase: if provider is not Ready, model
   sets Phase=Failed instead of creating an HTTPRoute to an unconfigured
   endpoint

3. Validate-before-create: ExternalProvider reconciler validates secret
   exists before creating networking resources. Failed validation =
   no resources created, Phase=Failed

4. Configurable HTTPRoute timeout: --default-route-timeout flag (default
   300s) on controller binary, passed through to buildHTTPRoute

5. Leader election comment in sample deployment YAML
Replace maas.opendatahub.io ExternalModel watcher with two
inference.opendatahub.io watchers:
- ExternalProvider reconciler → providerInfoStore (endpoint, creds, config)
- ExternalModel reconciler → resolves provider ref, writes to modelInfoStore

Provider config (map[string]string) flows through to CycleState via
ProviderConfigKey for downstream plugins (api-translation).

ProcessRequest unchanged — same CycleState keys, same downstream behavior.

Breaking change: BBR no longer resolves maas.opendatahub.io ExternalModel CRs.
RBAC updated to inference.opendatahub.io in Helm templates.
1. Cross-watch: ExternalModel plugin reconciler now Watches() ExternalProvider
   changes via mapProviderToModels. When a provider is updated (e.g., credential
   rotation), all referencing models are re-reconciled so modelInfoStore stays
   current. Without this, stale credentials persist in the data plane until
   the ExternalModel CR is touched.

2. Propagation test: verifies that updating provider credentials in providerStore
   and re-reconciling the model picks up the new values in modelStore.
- Model reconciler: delete stale modelStore entry when externalProviderRefs
  is empty, malformed, or missing provider ref name
- Provider reconciler: delete stale providerStore entry when required fields
  become empty (valid→invalid spec transition)
- Log List errors in mapProviderToModels instead of silently returning nil
The api-translation plugin now prefers the apiFormat field (from
ExternalModel.spec.externalProviderRefs[].apiFormat) over the provider
type (from ExternalProvider.spec.provider) when selecting the request/
response translator.

This enables explicit API format control — e.g., a Bedrock provider
can serve both OpenAI-compatible and Anthropic-format models by
specifying different apiFormat values per model.

Flow: ExternalModel CR → plugin reconciler extracts apiFormat →
CycleState (APIFormatKey) → api-translation resolves translator.
Falls back to ProviderKey if APIFormatKey is not set.
Enables an ExternalModel to reference multiple ExternalProviders with
relative weights for traffic splitting.

CRD:
- Remove MaxItems=1 on externalProviderRefs
- Add Weight field (*int32, default=1, minimum=0)

Controller (ExternalModel reconciler):
- Resolves all provider refs, creates HTTPRoute with per-provider rules
- Rule 1: path-based match (Kuadrant compatibility)
- Per-provider rules: X-Selected-Provider header match with provider-
  specific backend ref and Host header for TLS SNI

BBR plugin (model-provider-resolver):
- Weighted random selection across provider refs (precomputed totalWeight)
- Sets X-Selected-Provider header for deterministic Envoy routing
- Rewrites body model field to selected provider's targetModel
- Model name validation: checks CR name (client-facing) not targetModel

Example:
  externalProviderRefs:
    - ref: {name: openai-provider}
      targetModel: gpt-4o
      apiFormat: openai
      weight: 80
    - ref: {name: bedrock-provider}
      targetModel: gpt-4o-bedrock
      apiFormat: bedrock-openai
      weight: 20

E2E verified: 14/6 OpenAI/Anthropic distribution over 20 requests
(configured 80/20).
@openshift-ci openshift-ci Bot requested a review from nirrozenbaum April 29, 2026 20:20
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yossiovadia

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested a review from noyitz April 29, 2026 20:20
@openshift-ci openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Apr 29, 2026
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented Apr 29, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 29, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a complete Kubernetes controller infrastructure for managing external LLM providers and models. It adds two new CRD types (ExternalProvider and ExternalModel) in the inference.opendatahub.io API group with corresponding Go API implementations, implements two reconcilers to manage Kubernetes/Istio networking resources and Gateway API HTTPRoutes, extends the model-provider resolver plugin to support multi-provider model definitions, and includes deployment manifests with RBAC, Dockerized controller binary build, Makefile code generation targets, comprehensive test coverage, and sample resources for testing.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes


Security & Code Quality Issues

CWE-434 / Unrestricted Endpoint Configuration in ExternalProvider

The ExternalProvider.spec.endpoint is documented as a required FQDN but the provided summaries show no explicit validation constraints (regex patterns, allowed schemes, or length limits) on the endpoint field in the CRD schema. This allows a user with namespace-level create permissions on ExternalProviders to potentially specify arbitrary external hosts, which reconcilers will use to construct networking resources (ServiceEntry, DestinationRule, HTTPRoute header rewrites). While this is somewhat mitigated by the api.externalprovider_types.go field-level documentation, the CRD schema should enforce a validation rule (e.g., pattern: "^[a-z0-9]([a-z0-9-]*[a-z0-9])?(\\.[a-z0-9]([a-z0-9-]*[a-z0-9])?)*$" for DNS names). Action: Add OpenAPI schema validation for spec.endpoint in the CRD to reject invalid or overly broad hostnames; consider restricting to known external domains if the use-case permits.

CWE-639 / Authorization Bypass via Broad RBAC

The deployment manifest (deploy/controller/deployment.yaml) grants the controller's service account cluster-scoped permissions to create/update/delete Istio ServiceEntry and DestinationRule resources without a resource name restriction or label selector. Combined with the above unrestricted endpoint, an attacker with namespace-scoped ExternalProvider create access could indirectly create mesh-wide networking rules. The summary confirms "manage Istio ServiceEntry/DestinationRule resources" without qualification. Action: Verify the reconcilers apply label selectors (common-labels) consistently and update the ClusterRole to include a resourceNames restriction or require ownership via ownerReferences. Ensure the label selector predicate in SetupWithManager is strict.

CWE-269 / Improper Authentication in Weight-Based Provider Selection

The model_store.go selectProvider() method performs weighted random selection across provider references. The summary shows weight defaults to 1 with "non-negative" constraint, but does not document whether weight: 0 is rejected or treated as a valid "never select" signal. Action: Confirm CRD schema validation rejects weight: 0 or clarify semantics; if zero-weight is intentional, document it in code and ensure the weighting logic handles it gracefully.

CWE-347 / Insufficient Verification of Data Authenticity in Secret Reference Validation

The externalmodel/reconciler.go resolves spec.externalProviderRefs by fetching ExternalProvider objects in the same namespace, then extracts credentials from provider.Spec.Auth.SecretRef.name. The summary states reconcilers "validat[e] that the referenced namespace Secret exists" but does not confirm whether the secret's api-key field is validated for non-empty content. A missing or empty api-key would propagate silently to the backing HTTPRoute. Action: Validate that referenced secrets contain a non-empty api-key field; return a Failed condition with reason SecretInvalid if validation fails.

CWE-400 / Uncontrolled Resource Consumption via HTTPRoute Multiplicity

For each ExternalModel, the resources.go builder creates one HTTPRoute with multiple rules—one default plus one header-match per provider. With many providers (unbounded slice spec.externalProviderRefs), this could generate thousands of rules per route. The summaries do not mention a limit validation. Action: Add a CRD schema constraint limiting externalProviderRefs to a reasonable max (e.g., maxItems: 10) and document the scaling implications.

CWE-326 / Weak Cryptography / Unencrypted Data Transport

The externalprovider/resources.go builder creates DestinationRule with TLS mode: SIMPLE (mTLS origination). The summary confirms this, but does not mention tls.sni or certificate validation configuration. External providers may not present valid certificates. Action: Confirm that DestinationRule TLS mode handles certificate validation correctly; if SNI is required, document it in comments and validate provider endpoints are FQDN-compatible.

CWE-95 / Improper Neutralization via HTTPRoute Header Rewriting

The externalmodel/resources.go buildHTTPRoute injects provider endpoints directly into request Host headers via requestHeaderFilter. If an endpoint contains special characters or is not properly URL-encoded, downstream systems may misinterpret it. Additionally, rewriting Host without validating the endpoint format could enable header injection if endpoint validation is weak. Action: Ensure provider endpoint validation in the CRD schema is strict (regex-based FQDN check); add a unit test that confirms header values are safe and do not break route semantics.


Additional Observations

  • Generated code (zz_generated.deepcopy.go) is auto-generated and expected; no manual review needed beyond verifying the generator ran correctly.
  • Test coverage is extensive but concentrate review on reconciler_test.go files to confirm mocked/test Istio resources and schemes are correctly set up and do not leak into production.
  • Deployment manifest hard-codes image bbr-controller:latest; confirm CI/CD produces a tagged image in production.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the primary feature: multi-provider support with weighted traffic splitting, matching the core objective of enabling ExternalModel to reference multiple ExternalProviders with per-ref weights.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
⚔️ Resolve merge conflicts
  • Resolve merge conflict in branch feat/multi-provider-weights

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
Review rate limit: 0/1 reviews remaining, refill in 60 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

🧹 Nitpick comments (5)
pkg/plugins/model-provider-resolver/provider_store_test.go (1)

75-93: Lines 75-93: concurrent test has no correctness assertion.

This test currently passes on “no panic” only. Add post-Wait() invariants (e.g., any found entry must be non-nil and have non-empty provider) so regressions are detectable.

Suggested patch
 func TestProviderStore_ConcurrentAccess(t *testing.T) {
@@
 	wg.Wait()
+
+	for i := 0; i < 10; i++ {
+		key := types.NamespacedName{Namespace: "ns", Name: fmt.Sprintf("prov-%d", i)}
+		if info, found := store.get(key); found {
+			require.NotNil(t, info)
+			assert.NotEmpty(t, info.provider)
+		}
+	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/plugins/model-provider-resolver/provider_store_test.go` around lines 75 -
93, Update TestProviderStore_ConcurrentAccess to assert invariants after
wg.Wait: call newProviderInfoStore via existing variable `store` and for each
possible key used in the goroutines (e.g., NamespacedName with Namespace "ns"
and Name fmt.Sprintf("prov-%d", i) for i in 0..9) call `store.get(key)` and add
assertions that if the returned pointer is non-nil then its `provider` field is
not empty; use testing helpers (t.Fatalf/t.Errorf) to fail the test on
violation. Ensure you reference the existing methods `addOrUpdate`, `get`,
`delete` and the `providerInfo` struct when adding these checks so regressions
become detectable.
deploy/controller/samples/external-provider-katan.yaml (1)

1-24: Add comment clarifying these are sample/test resources.

The endpoint 3-13-21-181.sslip.io (resolves to 3.13.21.181) is clearly not a real provider endpoint. Consider adding a header comment indicating these are for local development/testing to prevent confusion or accidental deployment.

📝 Proposed comment
+# Sample ExternalProvider resources for local development/testing.
+# Replace endpoint and secretRef with real values for production use.
+---
 apiVersion: inference.opendatahub.io/v1alpha1
 kind: ExternalProvider
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/controller/samples/external-provider-katan.yaml` around lines 1 - 24,
Add a clear header comment at the top of the YAML indicating these resources are
sample/test/local-development only and not real production endpoints; update the
file containing ExternalProvider resources (look for kind: ExternalProvider and
metadata names katan-openai-provider and katan-vertex-provider) to include a
short comment block explaining that endpoint 3-13-21-181.sslip.io is a
placeholder resolving to 3.13.21.181 and should not be used in production or
committed without change, and optionally suggest replacing with real endpoints
or secrets for deployment.
Makefile (2)

71-77: Quote $(CONTROLLER_GEN) in the recipes.

These invocations break when the repo path contains spaces and they violate the Makefile rule for shell variables.

Suggested fix
 generate: controller-gen ## Generate deepcopy methods for CRD types.
-	$(CONTROLLER_GEN) object paths="./api/..."
+	"$(CONTROLLER_GEN)" object paths="./api/..."

 manifests: controller-gen ## Generate CRD manifests from Go types.
-	$(CONTROLLER_GEN) crd paths="./api/..." output:crd:artifacts:config=config/crd/bases
+	"$(CONTROLLER_GEN)" crd paths="./api/..." output:crd:artifacts:config=config/crd/bases

As per coding guidelines, "Quote shell variables in targets".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 71 - 77, The recipes for the generate and manifests
targets use the unquoted shell variable $(CONTROLLER_GEN), which breaks when the
repo path contains spaces and violates the Makefile rule to quote shell
variables; update the commands in the generate and manifests targets to quote
the variable (i.e., use "$(CONTROLLER_GEN)" in both the generate recipe and the
manifests recipe) so the command is treated as a single word and works with
paths containing spaces.

79-81: verify-codegen needs to be on the default verify path.

Right now this target exists, but make verify still skips it. Generated CRDs/deepcopy files can drift while local and CI verification still pass.

Follow-up change outside this hunk
-.PHONY: verify
-verify: tidy vet fmt lint  ## Verify the codebase (tidy, vet, fmt, lint).
+.PHONY: verify
+verify: tidy vet fmt lint verify-codegen ## Verify the codebase (tidy, vet, fmt, lint, codegen).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@Makefile` around lines 79 - 81, The Makefile defines a verify-codegen target
but the main verify target doesn't depend on it, so CI/local `make verify` skips
codegen checks; update the Makefile so the verify target depends on
verify-codegen (add verify-codegen to the dependency list for the verify target)
so that calling `make verify` will run the `verify-codegen` rule (target names:
verify, verify-codegen).
pkg/controller/externalmodel/reconciler.go (1)

199-218: Replace namespace-wide model scans with a field index on provider refs.

On every ExternalProvider event, mapProviderToModels lists all ExternalModel objects in the namespace and scans all refs. This degrades quickly with tenant/model count and increases reconcile latency.

Proposed refactor
 func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error {
+  if err := mgr.GetFieldIndexer().IndexField(context.Background(),
+    &inferencev1alpha1.ExternalModel{},
+    "spec.externalProviderRefs.ref.name",
+    func(raw client.Object) []string {
+      m := raw.(*inferencev1alpha1.ExternalModel)
+      out := make([]string, 0, len(m.Spec.ExternalProviderRefs))
+      for _, ref := range m.Spec.ExternalProviderRefs {
+        out = append(out, ref.Ref.Name)
+      }
+      return out
+    },
+  ); err != nil {
+    return err
+  }
+
   return ctrl.NewControllerManagedBy(mgr).
@@
 }
 
 func (r *Reconciler) mapProviderToModels(ctx context.Context, obj client.Object) []reconcile.Request {
   provider := obj.(*inferencev1alpha1.ExternalProvider)
   modelList := &inferencev1alpha1.ExternalModelList{}
-  if err := r.List(ctx, modelList, client.InNamespace(provider.Namespace)); err != nil {
+  if err := r.List(ctx, modelList,
+    client.InNamespace(provider.Namespace),
+    client.MatchingFields{"spec.externalProviderRefs.ref.name": provider.Name},
+  ); err != nil {
     return nil
   }

As per coding guidelines "REVIEW PRIORITIES: ... 4. Performance problems".

Also applies to: 221-228

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/externalmodel/reconciler.go` around lines 199 - 218, The
current mapProviderToModels implementation does an expensive namespace-wide List
and scan; replace it with a field index on ExternalModel external provider names
and query via MatchingFields. Add an index in the controller setup (e.g., in
SetupWithManager or SetupIndexes) using mgr.GetFieldIndexer().IndexField(ctx,
&v1alpha1.ExternalModel{}, "externalProviderRefNames", func(obj client.Object)
[]string { extract and return each ref.Ref.Name from
obj.(*ExternalModel).Spec.ExternalProviderRefs }), then change
mapProviderToModels to List only models matching
client.InNamespace(provider.Namespace) and
client.MatchingFields(map[string]string{"externalProviderRefNames":
provider.Name}) (remove the manual loop scanning refs). Repeat the same change
for the other similar block referenced (lines ~221-228).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.gitignore:
- Line 3: The .gitignore currently has an unanchored rule "controller" which
ignores any path segment named controller; update that line to anchor it to the
repository root by replacing the rule with "/controller" so only the top-level
binary artifact is ignored (locate the existing "controller" entry and change it
to "/controller").

In `@api/inference/v1alpha1/externalmodel_types.go`:
- Around line 42-47: The ExternalProviderRefs slice on the ExternalModel spec
lacks an upper bound, allowing unbounded expansion of HTTPRoute rules; add a
kubebuilder validation annotation to limit items (e.g., add `//
+kubebuilder:validation:MaxItems=10`) above the ExternalProviderRefs declaration
in externalmodel_types.go so the field `ExternalProviderRefs
[]ExternalProviderRef` is constrained; ensure the annotation uses the same
spelling/casing as the existing `+kubebuilder:validation:Required`/`MinItems`
comments so the CRD generator picks up the MaxItems limit.

In `@api/inference/v1alpha1/types_test.go`:
- Around line 63-83: The TestExternalModelDeepCopy currently only checks
string-field mutation; add verification for pointer-field aliasing by
initializing ExternalProviderRef.Weight (a *int32) on original.Spec (in
TestExternalModelDeepCopy), call original.DeepCopy(), mutate the
copied.Spec.ExternalProviderRefs[0].Weight value (e.g., change the int32 via
pointer deref) and assert that original.Spec.ExternalProviderRefs[0].Weight
still has the original value; ensure you allocate a separate int32 for the
original and copied objects to detect shallow-copy regressions in
ExternalModel.DeepCopy.

In `@config/crd/bases/inference.opendatahub.io_externalmodels.yaml`:
- Around line 17-23: The additionalPrinterColumns currently hardcode
.spec.externalProviderRefs[0] which only shows the first provider; update the
two columns under additionalPrinterColumns to aggregate all entries by replacing
.spec.externalProviderRefs[0].targetModel and
.spec.externalProviderRefs[0].ref.name with jsonPaths that iterate the array
(e.g. .spec.externalProviderRefs[*].targetModel and
.spec.externalProviderRefs[*].ref.name) so kubectl prints all providers/targets,
or remove these two columns entirely if you prefer not to expose aggregated
provider info; adjust the jsonPath entries for additionalPrinterColumns
accordingly.

In `@deploy/controller/deployment.yaml`:
- Around line 27-30: The role currently grants cluster-wide "list" and "watch"
on Secrets (the rule with resources: ["secrets"]), which is excessive; update
the rule for Secret validation used by the controller to only allow the minimal
verb(s) needed (e.g., "get") by removing "list" and "watch" from the verbs array
on that secrets rule in deployment.yaml (or scope to specific resourceNames if a
fixed secret must be accessed), leaving the rest of the role unchanged.

In `@Dockerfile.controller`:
- Line 25: The base image in the Dockerfile is using an unstable ':latest' tag
in the FROM statement (FROM --platform=$TARGETPLATFORM
registry.access.redhat.com/ubi9/ubi-minimal:latest); replace this with a pinned,
immutable reference — either a specific, versioned tag (e.g., ubi9:X.Y.Z) or
preferably an image digest (sha256) — so the controller runtime base image is
fixed and repeatable; update the FROM line to use that exact tag or digest and
record the chosen reference in your image release notes.

In `@docs/plans/2026-04-27-split-external-provider-prs.md`:
- Around line 42-47: Update the fenced code block containing the tree diagram
(the block that currently begins with ```) to declare a language identifier by
changing the opening fence to ```text so the diagram is treated as text and
satisfies MD040; leave the block content unchanged except for adding the
language token.

In `@pkg/controller/externalmodel/reconciler.go`:
- Around line 96-99: The reconcileHTTPRoute call currently converts
dependency/spec validation failures into a returned error causing retries;
instead, detect specific validation errors (missing refs, provider not found,
provider not-ready) inside reconciler.go (in reconcileHTTPRoute and the
subsequent checks around lines where reconcileHTTPRoute is invoked and status is
set) and handle them as non-fatal: call r.setStatus(ctx, logger, model,
"Failed", metav1.ConditionFalse, "<SpecificReason>", err.Error()) and return
ctrl.Result{RequeueAfter: <optionalDuration>} with nil error (or ctrl.Result{}
with nil error for non-transient cases) rather than returning err. Update all
similar blocks (including the region flagged around 106-123) so
provider/dependency readiness errors do not bubble as reconcile failures but
only update status and control requeue behavior. Ensure specific reason strings
differentiate missing refs vs provider-not-found vs not-ready for observability.
- Around line 158-169: The setStatus function currently always calls
r.Status().Update which causes needless API writes; change setStatus to first
compute the desired phase and desired metav1.Condition (same fields used with
meta.SetStatusCondition), then compare the existing model.Status.Phase and
model.Status.Conditions to the desired values (use a deep equality check such as
equality.Semantic.DeepEqual or reflect.DeepEqual on the Conditions slice) and
return early if nothing changed; only call meta.SetStatusCondition and
r.Status().Update when the phase or conditions differ (ensure ObservedGeneration
is considered in the comparison so updates that only change generation still
trigger an update).

In `@pkg/controller/externalmodel/resources.go`:
- Around line 95-123: The provider-specific HTTPRouteRule created inside the
providers loop in resources.go only matches on the X-Selected-Provider header;
update the Matches for each gatewayapiv1.HTTPRouteRule (the rule built inside
"for _, p := range providers") to include the same PathPrefix HTTPRouteMatch
used by the initial/default rule so provider rules also constrain by request
path; add a second match entry (or augment the existing HTTPRouteMatch) that
uses gatewayapiv1.HTTPRouteMatch.PathPrefix with the identical path
value/variable used by the initial rule to prevent cross-model routing.

In `@pkg/controller/externalprovider/reconciler.go`:
- Around line 130-141: The validateSecretRef method currently only checks Secret
existence; modify Reconciler.validateSecretRef to fetch the Secret into a
corev1.Secret (instead of passing an empty struct), then verify the Secret.Data
(or StringData) contains a non-empty "api-key" entry for
provider.Spec.Auth.SecretRef.Name in provider.Namespace and return a descriptive
error (e.g., "secret %q missing non-empty api-key") when absent or empty; keep
the existing not-found and Get error handling but add this credential
presence/emptiness validation before returning nil.
- Line 49: The RBAC marker granting Secrets access is too permissive; update the
kubebuilder RBAC comment that currently includes "verbs=get;list;watch" to only
include "verbs=get" for secrets. Locate the RBAC line near the top of the
reconciler (the marker above the reconciler type) and change the verbs to "get"
only—this aligns with the reconciler's validateSecretRef method which only calls
r.Get(ctx, key, &corev1.Secret{}) and avoids adding unnecessary list/watch
permissions.

In `@pkg/plugins/model-provider-resolver/external_model_reconciler.go`:
- Around line 81-85: When providerStore.get(providerKey) returns not found,
delete the cached externalModelInfo for this model from modelStore before
returning RequeueAfter so we fail closed; locate the branch around
providerStore.get(providerKey) / logger.Info(...) and add a call to remove the
model's cached entry from modelStore (e.g., modelStore.delete(...) or
modelStore.remove(...) for the same key used to store externalModelInfo)
immediately before returning ctrl.Result{RequeueAfter: 2 * time.Second}, nil so
ProcessRequest no longer routes using stale provider data.

In `@pkg/plugins/model-provider-resolver/plugin_test.go`:
- Around line 179-187: The test's externalModelInfo instance never sets
totalWeight so selectProvider falls back to refs[0] and never exercises weighted
selection; set externalModelInfo.totalWeight to the sum of the providerRef
weights (e.g., 100 + 0 = 100) when constructing the test data so selectProvider
runs the weighted logic. Update the test's externalModelInfo (the struct used
with selectProvider and providerRef entries) to initialize totalWeight
appropriately to reflect the weights in the refs.

In `@pkg/plugins/model-provider-resolver/plugin.go`:
- Around line 92-109: The loop only inspects refs[0] so models that reference
the provider at index >0 won't be requeued; update the loop over modelList.Items
to iterate all entries in the extracted refs slice (for example: for j := range
refs { ... }) and for each entry perform the map[string]any type assertion, call
nestedString(refMap, "ref", "name") and compare to providerName, and if it
matches append the reconcile.Request (same NamespacedName logic) to requests;
ensure you continue on non-map entries and do not assume refs[0] exists.

In `@pkg/plugins/model-provider-resolver/provider_store.go`:
- Around line 44-60: The store currently saves caller-owned *providerInfo
pointers in addOrUpdate and returns internal pointers from get, allowing callers
to mutate providerInfo (notably the config map) outside s.lock and cause races;
fix by making and storing a defensive deep copy of providerInfo (including a
cloned config map) inside addOrUpdate while holding s.lock, and have get return
a copy (not the internal pointer) created while holding s.lock (implement a
helper like cloneProviderInfo to copy fields and maps so no caller-owned pointer
escapes the lock on read or write).

---

Nitpick comments:
In `@deploy/controller/samples/external-provider-katan.yaml`:
- Around line 1-24: Add a clear header comment at the top of the YAML indicating
these resources are sample/test/local-development only and not real production
endpoints; update the file containing ExternalProvider resources (look for kind:
ExternalProvider and metadata names katan-openai-provider and
katan-vertex-provider) to include a short comment block explaining that endpoint
3-13-21-181.sslip.io is a placeholder resolving to 3.13.21.181 and should not be
used in production or committed without change, and optionally suggest replacing
with real endpoints or secrets for deployment.

In `@Makefile`:
- Around line 71-77: The recipes for the generate and manifests targets use the
unquoted shell variable $(CONTROLLER_GEN), which breaks when the repo path
contains spaces and violates the Makefile rule to quote shell variables; update
the commands in the generate and manifests targets to quote the variable (i.e.,
use "$(CONTROLLER_GEN)" in both the generate recipe and the manifests recipe) so
the command is treated as a single word and works with paths containing spaces.
- Around line 79-81: The Makefile defines a verify-codegen target but the main
verify target doesn't depend on it, so CI/local `make verify` skips codegen
checks; update the Makefile so the verify target depends on verify-codegen (add
verify-codegen to the dependency list for the verify target) so that calling
`make verify` will run the `verify-codegen` rule (target names: verify,
verify-codegen).

In `@pkg/controller/externalmodel/reconciler.go`:
- Around line 199-218: The current mapProviderToModels implementation does an
expensive namespace-wide List and scan; replace it with a field index on
ExternalModel external provider names and query via MatchingFields. Add an index
in the controller setup (e.g., in SetupWithManager or SetupIndexes) using
mgr.GetFieldIndexer().IndexField(ctx, &v1alpha1.ExternalModel{},
"externalProviderRefNames", func(obj client.Object) []string { extract and
return each ref.Ref.Name from obj.(*ExternalModel).Spec.ExternalProviderRefs }),
then change mapProviderToModels to List only models matching
client.InNamespace(provider.Namespace) and
client.MatchingFields(map[string]string{"externalProviderRefNames":
provider.Name}) (remove the manual loop scanning refs). Repeat the same change
for the other similar block referenced (lines ~221-228).

In `@pkg/plugins/model-provider-resolver/provider_store_test.go`:
- Around line 75-93: Update TestProviderStore_ConcurrentAccess to assert
invariants after wg.Wait: call newProviderInfoStore via existing variable
`store` and for each possible key used in the goroutines (e.g., NamespacedName
with Namespace "ns" and Name fmt.Sprintf("prov-%d", i) for i in 0..9) call
`store.get(key)` and add assertions that if the returned pointer is non-nil then
its `provider` field is not empty; use testing helpers (t.Fatalf/t.Errorf) to
fail the test on violation. Ensure you reference the existing methods
`addOrUpdate`, `get`, `delete` and the `providerInfo` struct when adding these
checks so regressions become detectable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 14631011-00f1-4ee0-bdb8-be64e66391bd

📥 Commits

Reviewing files that changed from the base of the PR and between f83425e and 6c1b39e.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum, !**/*.sum
📒 Files selected for processing (42)
  • .gitignore
  • Dockerfile.controller
  • Makefile
  • api/inference/v1alpha1/externalmodel_types.go
  • api/inference/v1alpha1/externalprovider_types.go
  • api/inference/v1alpha1/groupversion_info.go
  • api/inference/v1alpha1/types_test.go
  • api/inference/v1alpha1/zz_generated.deepcopy.go
  • cmd/controller/main.go
  • config/crd/bases/inference.opendatahub.io_externalmodels.yaml
  • config/crd/bases/inference.opendatahub.io_externalproviders.yaml
  • deploy/controller/deployment.yaml
  • deploy/controller/samples/external-model-katan.yaml
  • deploy/controller/samples/external-provider-katan.yaml
  • deploy/payload-processing/templates/rbac.yaml
  • docs/plans/2026-04-27-split-external-provider-prs.md
  • go.mod
  • pkg/controller/common/constants.go
  • pkg/controller/externalmodel/reconciler.go
  • pkg/controller/externalmodel/reconciler_test.go
  • pkg/controller/externalmodel/resources.go
  • pkg/controller/externalmodel/resources_test.go
  • pkg/controller/externalmodel/testdata/gateway-api-crds/httproute-crd.yaml
  • pkg/controller/externalprovider/reconciler.go
  • pkg/controller/externalprovider/reconciler_test.go
  • pkg/controller/externalprovider/resources.go
  • pkg/controller/externalprovider/resources_test.go
  • pkg/controller/externalprovider/testdata/istio-crds/destinationrule-crd.yaml
  • pkg/controller/externalprovider/testdata/istio-crds/serviceentry-crd.yaml
  • pkg/plugins/api-translation/plugin.go
  • pkg/plugins/api-translation/plugin_test.go
  • pkg/plugins/common/state/state-keys.go
  • pkg/plugins/model-provider-resolver/external_model_reconciler.go
  • pkg/plugins/model-provider-resolver/external_model_reconciler_test.go
  • pkg/plugins/model-provider-resolver/external_provider_reconciler.go
  • pkg/plugins/model-provider-resolver/external_provider_reconciler_test.go
  • pkg/plugins/model-provider-resolver/model_store.go
  • pkg/plugins/model-provider-resolver/model_store_test.go
  • pkg/plugins/model-provider-resolver/plugin.go
  • pkg/plugins/model-provider-resolver/plugin_test.go
  • pkg/plugins/model-provider-resolver/provider_store.go
  • pkg/plugins/model-provider-resolver/provider_store_test.go

Comment thread .gitignore
@@ -1,2 +1,3 @@
# Binaries for programs and plugins
bin/*
controller
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Line 3: anchor the ignore rule to repo root.

controller matches any path segment named controller; use /controller to target only the root binary artifact.

Suggested patch
-controller
+/controller
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
controller
/controller
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.gitignore at line 3, The .gitignore currently has an unanchored rule
"controller" which ignores any path segment named controller; update that line
to anchor it to the repository root by replacing the rule with "/controller" so
only the top-level binary artifact is ignored (locate the existing "controller"
entry and change it to "/controller").

Comment on lines +42 to +47
// ExternalProviderRefs maps this model to one or more external providers.
// Each entry specifies the provider, provider-side model name, API format,
// and an optional weight for traffic splitting.
// +kubebuilder:validation:Required
// +kubebuilder:validation:MinItems=1
ExternalProviderRefs []ExternalProviderRef `json:"externalProviderRefs"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does Kubebuilder/controller-tools support +kubebuilder:validation:MaxItems on slice fields in CRD schemas, and is it enforced by generated CRDs?

💡 Result:

Yes, Kubebuilder/controller-tools supports +kubebuilder:validation:MaxItems on slice fields in CRD schemas, and it is enforced by the generated CRDs. The marker generates the corresponding maxItems property in the OpenAPI v3 validation schema of the CRD, which Kubernetes apiserver enforces during create/update operations.

Citations:


🏁 Script executed:

# Check the ExternalModel type and ExternalProviderRef struct
fd -t f 'externalmodel_types.go' --exec cat -n {} \;

Repository: opendatahub-io/ai-gateway-payload-processing

Length of output: 5319


🏁 Script executed:

# Check how ExternalProviderRefs is used in the controller
fd -t f 'resources.go' api/inference --exec grep -n 'ExternalProviderRefs' {} +

Repository: opendatahub-io/ai-gateway-payload-processing

Length of output: 70


🏁 Script executed:

# Search for HTTPRoute generation logic in controller code
fd -t f -e go 'externalmodel' pkg/controller --exec grep -l 'HTTPRoute' {} \;

Repository: opendatahub-io/ai-gateway-payload-processing

Length of output: 70


🏁 Script executed:

# Check for other unbounded slices in the same types file
fd -t f 'externalmodel_types.go' --exec rg '\[\]' {} -n \;

Repository: opendatahub-io/ai-gateway-payload-processing

Length of output: 282


🏁 Script executed:

# Find controller directories and files for externalmodel
fd -t f -e go pkg/controller | head -20

Repository: opendatahub-io/ai-gateway-payload-processing

Length of output: 450


🏁 Script executed:

# Search for where ExternalProviderRef is used in the codebase
rg 'ExternalProviderRef' -t go -l

Repository: opendatahub-io/ai-gateway-payload-processing

Length of output: 219


🏁 Script executed:

# Look for any controller that processes ExternalModel spec
rg 'ExternalModel' pkg/controller -t go -l | head -5

Repository: opendatahub-io/ai-gateway-payload-processing

Length of output: 268


🏁 Script executed:

# Examine the resources.go file to see how ExternalProviderRefs is processed
wc -l pkg/controller/externalmodel/resources.go

Repository: opendatahub-io/ai-gateway-payload-processing

Length of output: 134


🏁 Script executed:

# Get the content of resources.go
cat -n pkg/controller/externalmodel/resources.go

Repository: opendatahub-io/ai-gateway-payload-processing

Length of output: 5591


Add MaxItems to externalProviderRefs to prevent unbounded HTTPRoute expansion.

Each provider ref expands into an HTTPRoute rule (see pkg/controller/externalmodel/resources.go line 96-124); without an upper bound, a single CR can force the controller to generate an arbitrarily large rule set, causing resource exhaustion (CWE-770). Per coding guidelines (**/api/**/*_types.go), add MaxItems validation.

Suggested fix
 	// +kubebuilder:validation:Required
 	// +kubebuilder:validation:MinItems=1
+	// +kubebuilder:validation:MaxItems=16
 	ExternalProviderRefs []ExternalProviderRef `json:"externalProviderRefs"`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// ExternalProviderRefs maps this model to one or more external providers.
// Each entry specifies the provider, provider-side model name, API format,
// and an optional weight for traffic splitting.
// +kubebuilder:validation:Required
// +kubebuilder:validation:MinItems=1
ExternalProviderRefs []ExternalProviderRef `json:"externalProviderRefs"`
// ExternalProviderRefs maps this model to one or more external providers.
// Each entry specifies the provider, provider-side model name, API format,
// and an optional weight for traffic splitting.
// +kubebuilder:validation:Required
// +kubebuilder:validation:MinItems=1
// +kubebuilder:validation:MaxItems=16
ExternalProviderRefs []ExternalProviderRef `json:"externalProviderRefs"`
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/inference/v1alpha1/externalmodel_types.go` around lines 42 - 47, The
ExternalProviderRefs slice on the ExternalModel spec lacks an upper bound,
allowing unbounded expansion of HTTPRoute rules; add a kubebuilder validation
annotation to limit items (e.g., add `// +kubebuilder:validation:MaxItems=10`)
above the ExternalProviderRefs declaration in externalmodel_types.go so the
field `ExternalProviderRefs []ExternalProviderRef` is constrained; ensure the
annotation uses the same spelling/casing as the existing
`+kubebuilder:validation:Required`/`MinItems` comments so the CRD generator
picks up the MaxItems limit.

Comment on lines +63 to +83
func TestExternalModelDeepCopy(t *testing.T) {
original := &ExternalModel{
Spec: ExternalModelSpec{
ExternalProviderRefs: []ExternalProviderRef{
{
Ref: NameReference{Name: "my-openai"},
TargetModel: "gpt-4o",
APIFormat: "openai-chat",
},
},
},
}

copied := original.DeepCopy()

assert.Equal(t, original.Spec, copied.Spec)

// Verify deep copy — mutating the copy must not affect the original
copied.Spec.ExternalProviderRefs[0].TargetModel = "gpt-3.5"
assert.Equal(t, "gpt-4o", original.Spec.ExternalProviderRefs[0].TargetModel)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

DeepCopy test misses pointer-field aliasing on Weight (Line [63]).

ExternalProviderRef.Weight is a *int32, but this test only validates string field mutation. Add a pointer mutation assertion to catch shallow-copy regressions.

Proposed test hardening
 func TestExternalModelDeepCopy(t *testing.T) {
+	w := int32(7)
 	original := &ExternalModel{
 		Spec: ExternalModelSpec{
 			ExternalProviderRefs: []ExternalProviderRef{
 				{
 					Ref:         NameReference{Name: "my-openai"},
 					TargetModel: "gpt-4o",
 					APIFormat:   "openai-chat",
+					Weight:      &w,
 				},
 			},
 		},
 	}
@@
 	copied.Spec.ExternalProviderRefs[0].TargetModel = "gpt-3.5"
 	assert.Equal(t, "gpt-4o", original.Spec.ExternalProviderRefs[0].TargetModel)
+
+	*copied.Spec.ExternalProviderRefs[0].Weight = 99
+	require.NotNil(t, original.Spec.ExternalProviderRefs[0].Weight)
+	assert.Equal(t, int32(7), *original.Spec.ExternalProviderRefs[0].Weight)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func TestExternalModelDeepCopy(t *testing.T) {
original := &ExternalModel{
Spec: ExternalModelSpec{
ExternalProviderRefs: []ExternalProviderRef{
{
Ref: NameReference{Name: "my-openai"},
TargetModel: "gpt-4o",
APIFormat: "openai-chat",
},
},
},
}
copied := original.DeepCopy()
assert.Equal(t, original.Spec, copied.Spec)
// Verify deep copy — mutating the copy must not affect the original
copied.Spec.ExternalProviderRefs[0].TargetModel = "gpt-3.5"
assert.Equal(t, "gpt-4o", original.Spec.ExternalProviderRefs[0].TargetModel)
}
func TestExternalModelDeepCopy(t *testing.T) {
w := int32(7)
original := &ExternalModel{
Spec: ExternalModelSpec{
ExternalProviderRefs: []ExternalProviderRef{
{
Ref: NameReference{Name: "my-openai"},
TargetModel: "gpt-4o",
APIFormat: "openai-chat",
Weight: &w,
},
},
},
}
copied := original.DeepCopy()
assert.Equal(t, original.Spec, copied.Spec)
// Verify deep copy — mutating the copy must not affect the original
copied.Spec.ExternalProviderRefs[0].TargetModel = "gpt-3.5"
assert.Equal(t, "gpt-4o", original.Spec.ExternalProviderRefs[0].TargetModel)
*copied.Spec.ExternalProviderRefs[0].Weight = 99
require.NotNil(t, original.Spec.ExternalProviderRefs[0].Weight)
assert.Equal(t, int32(7), *original.Spec.ExternalProviderRefs[0].Weight)
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@api/inference/v1alpha1/types_test.go` around lines 63 - 83, The
TestExternalModelDeepCopy currently only checks string-field mutation; add
verification for pointer-field aliasing by initializing
ExternalProviderRef.Weight (a *int32) on original.Spec (in
TestExternalModelDeepCopy), call original.DeepCopy(), mutate the
copied.Spec.ExternalProviderRefs[0].Weight value (e.g., change the int32 via
pointer deref) and assert that original.Spec.ExternalProviderRefs[0].Weight
still has the original value; ensure you allocate a separate int32 for the
original and copied objects to detect shallow-copy regressions in
ExternalModel.DeepCopy.

Comment on lines +17 to +23
- additionalPrinterColumns:
- jsonPath: .spec.externalProviderRefs[0].targetModel
name: TargetModel
type: string
- jsonPath: .spec.externalProviderRefs[0].ref.name
name: Provider
type: string
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

These printer columns are now misleading for multi-provider models.

Both columns hardcode externalProviderRefs[0], so kubectl get externalmodels reports only the first provider/ref even when the model is split across several backends. Either remove them or replace them with aggregate columns.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/crd/bases/inference.opendatahub.io_externalmodels.yaml` around lines
17 - 23, The additionalPrinterColumns currently hardcode
.spec.externalProviderRefs[0] which only shows the first provider; update the
two columns under additionalPrinterColumns to aggregate all entries by replacing
.spec.externalProviderRefs[0].targetModel and
.spec.externalProviderRefs[0].ref.name with jsonPaths that iterate the array
(e.g. .spec.externalProviderRefs[*].targetModel and
.spec.externalProviderRefs[*].ref.name) so kubectl prints all providers/targets,
or remove these two columns entirely if you prefer not to expose aggregated
provider info; adjust the jsonPath entries for additionalPrinterColumns
accordingly.

Comment on lines +27 to +30
# Secrets (validate secretRef)
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "watch"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Major: drop cluster-wide secret listing/watching from this role (CWE-250).

The controller only appears to validate a referenced Secret by name. Granting list and watch lets a compromised controller enumerate Secret objects across all namespaces, which is unnecessary privilege for this code path.

Suggested fix
   - apiGroups: [""]
     resources: ["secrets"]
-    verbs: ["get", "list", "watch"]
+    verbs: ["get"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Secrets (validate secretRef)
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get", "list", "watch"]
# Secrets (validate secretRef)
- apiGroups: [""]
resources: ["secrets"]
verbs: ["get"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/controller/deployment.yaml` around lines 27 - 30, The role currently
grants cluster-wide "list" and "watch" on Secrets (the rule with resources:
["secrets"]), which is excessive; update the rule for Secret validation used by
the controller to only allow the minimal verb(s) needed (e.g., "get") by
removing "list" and "watch" from the verbs array on that secrets rule in
deployment.yaml (or scope to specific resourceNames if a fixed secret must be
accessed), leaving the rest of the role unchanged.

Comment on lines +130 to +141
func (r *Reconciler) validateSecretRef(ctx context.Context, provider *inferencev1alpha1.ExternalProvider) error {
key := types.NamespacedName{
Name: provider.Spec.Auth.SecretRef.Name,
Namespace: provider.Namespace,
}
if err := r.Get(ctx, key, &corev1.Secret{}); err != nil {
if apierrors.IsNotFound(err) {
return fmt.Errorf("secret %q not found in namespace %q", key.Name, key.Namespace)
}
return fmt.Errorf("failed to get secret %q: %w", key.Name, err)
}
return nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail reconciliation when the Secret lacks a non-empty api-key.

This only checks that the Secret exists. A Secret without data["api-key"] still lets the provider reach Ready, and downstream auth injection will then fail every request at runtime. Validate the required credential key here instead (CWE-20).

Suggested fix
 func (r *Reconciler) validateSecretRef(ctx context.Context, provider *inferencev1alpha1.ExternalProvider) error {
 	key := types.NamespacedName{
 		Name:      provider.Spec.Auth.SecretRef.Name,
 		Namespace: provider.Namespace,
 	}
-	if err := r.Get(ctx, key, &corev1.Secret{}); err != nil {
+	secret := &corev1.Secret{}
+	if err := r.Get(ctx, key, secret); err != nil {
 		if apierrors.IsNotFound(err) {
 			return fmt.Errorf("secret %q not found in namespace %q", key.Name, key.Namespace)
 		}
 		return fmt.Errorf("failed to get secret %q: %w", key.Name, err)
 	}
+	apiKey, ok := secret.Data["api-key"]
+	if !ok || len(apiKey) == 0 {
+		return fmt.Errorf("secret %q in namespace %q must contain non-empty data[\"api-key\"]", key.Name, key.Namespace)
+	}
 	return nil
 }

As per coding guidelines, **/*.go: 4. Validate CR spec fields before using in ConfigMaps/Secrets.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (r *Reconciler) validateSecretRef(ctx context.Context, provider *inferencev1alpha1.ExternalProvider) error {
key := types.NamespacedName{
Name: provider.Spec.Auth.SecretRef.Name,
Namespace: provider.Namespace,
}
if err := r.Get(ctx, key, &corev1.Secret{}); err != nil {
if apierrors.IsNotFound(err) {
return fmt.Errorf("secret %q not found in namespace %q", key.Name, key.Namespace)
}
return fmt.Errorf("failed to get secret %q: %w", key.Name, err)
}
return nil
func (r *Reconciler) validateSecretRef(ctx context.Context, provider *inferencev1alpha1.ExternalProvider) error {
key := types.NamespacedName{
Name: provider.Spec.Auth.SecretRef.Name,
Namespace: provider.Namespace,
}
secret := &corev1.Secret{}
if err := r.Get(ctx, key, secret); err != nil {
if apierrors.IsNotFound(err) {
return fmt.Errorf("secret %q not found in namespace %q", key.Name, key.Namespace)
}
return fmt.Errorf("failed to get secret %q: %w", key.Name, err)
}
apiKey, ok := secret.Data["api-key"]
if !ok || len(apiKey) == 0 {
return fmt.Errorf("secret %q in namespace %q must contain non-empty data[\"api-key\"]", key.Name, key.Namespace)
}
return nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/externalprovider/reconciler.go` around lines 130 - 141, The
validateSecretRef method currently only checks Secret existence; modify
Reconciler.validateSecretRef to fetch the Secret into a corev1.Secret (instead
of passing an empty struct), then verify the Secret.Data (or StringData)
contains a non-empty "api-key" entry for provider.Spec.Auth.SecretRef.Name in
provider.Namespace and return a descriptive error (e.g., "secret %q missing
non-empty api-key") when absent or empty; keep the existing not-found and Get
error handling but add this credential presence/emptiness validation before
returning nil.

Comment on lines +81 to +85
provInfo, found := r.providerStore.get(providerKey)
if !found {
logger.Info("ExternalProvider not yet available, requeueing", "provider", providerRefName)
return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Delete the cached model entry before requeueing a missing provider.

If this model was already resolved once, Line 84 returns RequeueAfter but leaves the old externalModelInfo in modelStore. ProcessRequest will keep routing with stale provider data until a later successful reconcile. Fail closed here.

Suggested fix
 		providerKey := types.NamespacedName{Namespace: req.Namespace, Name: providerRefName}
 		provInfo, found := r.providerStore.get(providerKey)
 		if !found {
+			r.modelStore.deleteExternalModel(req.NamespacedName)
 			logger.Info("ExternalProvider not yet available, requeueing", "provider", providerRefName)
 			return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
provInfo, found := r.providerStore.get(providerKey)
if !found {
logger.Info("ExternalProvider not yet available, requeueing", "provider", providerRefName)
return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
}
providerKey := types.NamespacedName{Namespace: req.Namespace, Name: providerRefName}
provInfo, found := r.providerStore.get(providerKey)
if !found {
r.modelStore.deleteExternalModel(req.NamespacedName)
logger.Info("ExternalProvider not yet available, requeueing", "provider", providerRefName)
return ctrl.Result{RequeueAfter: 2 * time.Second}, nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/plugins/model-provider-resolver/external_model_reconciler.go` around
lines 81 - 85, When providerStore.get(providerKey) returns not found, delete the
cached externalModelInfo for this model from modelStore before returning
RequeueAfter so we fail closed; locate the branch around
providerStore.get(providerKey) / logger.Info(...) and add a call to remove the
model's cached entry from modelStore (e.g., modelStore.delete(...) or
modelStore.remove(...) for the same key used to store externalModelInfo)
immediately before returning ctrl.Result{RequeueAfter: 2 * time.Second}, nil so
ProcessRequest no longer routes using stale provider data.

Comment on lines +179 to +187
&externalModelInfo{
modelName: "gpt4",
refs: []providerRef{
{providerName: "openai-provider", provider: "openai", targetModel: "gpt-4o",
secretName: "k1", secretNamespace: "llm", apiFormat: "openai", weight: 100},
{providerName: "bedrock-provider", provider: "bedrock-openai", targetModel: "gpt-4o-bedrock",
secretName: "k2", secretNamespace: "llm", apiFormat: "bedrock-openai", weight: 0},
},
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Initialize totalWeight, or this test never exercises weighted selection.

pkg/plugins/model-provider-resolver/model_store.go:42-63 only uses weights when externalModelInfo.totalWeight > 0. Here it stays zero, so selectProvider() falls back to refs[0] and the test passes even if weighted routing is broken.

Suggested fix
 		&externalModelInfo{
 			modelName: "gpt4",
+			totalWeight: 100,
 			refs: []providerRef{
 				{providerName: "openai-provider", provider: "openai", targetModel: "gpt-4o",
 					secretName: "k1", secretNamespace: "llm", apiFormat: "openai", weight: 100},
 				{providerName: "bedrock-provider", provider: "bedrock-openai", targetModel: "gpt-4o-bedrock",
 					secretName: "k2", secretNamespace: "llm", apiFormat: "bedrock-openai", weight: 0},
 			},
 		},
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
&externalModelInfo{
modelName: "gpt4",
refs: []providerRef{
{providerName: "openai-provider", provider: "openai", targetModel: "gpt-4o",
secretName: "k1", secretNamespace: "llm", apiFormat: "openai", weight: 100},
{providerName: "bedrock-provider", provider: "bedrock-openai", targetModel: "gpt-4o-bedrock",
secretName: "k2", secretNamespace: "llm", apiFormat: "bedrock-openai", weight: 0},
},
},
&externalModelInfo{
modelName: "gpt4",
totalWeight: 100,
refs: []providerRef{
{providerName: "openai-provider", provider: "openai", targetModel: "gpt-4o",
secretName: "k1", secretNamespace: "llm", apiFormat: "openai", weight: 100},
{providerName: "bedrock-provider", provider: "bedrock-openai", targetModel: "gpt-4o-bedrock",
secretName: "k2", secretNamespace: "llm", apiFormat: "bedrock-openai", weight: 0},
},
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/plugins/model-provider-resolver/plugin_test.go` around lines 179 - 187,
The test's externalModelInfo instance never sets totalWeight so selectProvider
falls back to refs[0] and never exercises weighted selection; set
externalModelInfo.totalWeight to the sum of the providerRef weights (e.g., 100 +
0 = 100) when constructing the test data so selectProvider runs the weighted
logic. Update the test's externalModelInfo (the struct used with selectProvider
and providerRef entries) to initialize totalWeight appropriately to reflect the
weights in the refs.

Comment on lines +92 to +109
for i := range modelList.Items {
refs, _, _ := unstructured.NestedSlice(modelList.Items[i].Object, "spec", "externalProviderRefs")
if len(refs) == 0 {
continue
}
refMap, ok := refs[0].(map[string]any)
if !ok {
continue
}
if nestedString(refMap, "ref", "name") == providerName {
requests = append(requests, reconcile.Request{
NamespacedName: types.NamespacedName{
Name: modelList.Items[i].GetName(),
Namespace: modelList.Items[i].GetNamespace(),
},
})
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scan all externalProviderRefs when mapping provider updates.

Line 97 only checks refs[0]. Any model that references this provider in position 2+ will never be re-enqueued on provider changes, so its cached secret/config stays stale.

Suggested fix
 		var requests []reconcile.Request
 		for i := range modelList.Items {
 			refs, _, _ := unstructured.NestedSlice(modelList.Items[i].Object, "spec", "externalProviderRefs")
 			if len(refs) == 0 {
 				continue
 			}
-			refMap, ok := refs[0].(map[string]any)
-			if !ok {
-				continue
-			}
-			if nestedString(refMap, "ref", "name") == providerName {
-				requests = append(requests, reconcile.Request{
-					NamespacedName: types.NamespacedName{
-						Name:      modelList.Items[i].GetName(),
-						Namespace: modelList.Items[i].GetNamespace(),
-					},
-				})
+			for _, rawRef := range refs {
+				refMap, ok := rawRef.(map[string]any)
+				if !ok {
+					continue
+				}
+				if nestedString(refMap, "ref", "name") != providerName {
+					continue
+				}
+				requests = append(requests, reconcile.Request{
+					NamespacedName: types.NamespacedName{
+						Name:      modelList.Items[i].GetName(),
+						Namespace: modelList.Items[i].GetNamespace(),
+					},
+				})
+				break
 			}
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/plugins/model-provider-resolver/plugin.go` around lines 92 - 109, The
loop only inspects refs[0] so models that reference the provider at index >0
won't be requeued; update the loop over modelList.Items to iterate all entries
in the extracted refs slice (for example: for j := range refs { ... }) and for
each entry perform the map[string]any type assertion, call nestedString(refMap,
"ref", "name") and compare to providerName, and if it matches append the
reconcile.Request (same NamespacedName logic) to requests; ensure you continue
on non-map entries and do not assume refs[0] exists.

Comment on lines +44 to +60
func (s *providerInfoStore) addOrUpdate(key types.NamespacedName, info *providerInfo) {
s.lock.Lock()
defer s.lock.Unlock()
s.providers[key.String()] = info
}

func (s *providerInfoStore) delete(key types.NamespacedName) {
s.lock.Lock()
defer s.lock.Unlock()
delete(s.providers, key.String())
}

func (s *providerInfoStore) get(key types.NamespacedName) (*providerInfo, bool) {
s.lock.RLock()
defer s.lock.RUnlock()
info, ok := s.providers[key.String()]
return info, ok
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Lines 44-60: mutable pointers escape the lock boundary (CWE-362).

addOrUpdate stores caller-owned pointers and get returns internal pointers. Any downstream mutation (especially config map writes) can race outside RWMutex protection and corrupt store state.

Suggested patch
 import (
+	"maps"
 	"sync"

 	"k8s.io/apimachinery/pkg/types"
 )
@@
 func (s *providerInfoStore) addOrUpdate(key types.NamespacedName, info *providerInfo) {
 	s.lock.Lock()
 	defer s.lock.Unlock()
-	s.providers[key.String()] = info
+	s.providers[key.String()] = cloneProviderInfo(info)
 }
@@
 func (s *providerInfoStore) get(key types.NamespacedName) (*providerInfo, bool) {
 	s.lock.RLock()
 	defer s.lock.RUnlock()
 	info, ok := s.providers[key.String()]
-	return info, ok
+	if !ok {
+		return nil, false
+	}
+	return cloneProviderInfo(info), true
 }
+
+func cloneProviderInfo(in *providerInfo) *providerInfo {
+	if in == nil {
+		return nil
+	}
+	out := *in
+	out.config = maps.Clone(in.config)
+	return &out
+}

As per coding guidelines, **: "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (s *providerInfoStore) addOrUpdate(key types.NamespacedName, info *providerInfo) {
s.lock.Lock()
defer s.lock.Unlock()
s.providers[key.String()] = info
}
func (s *providerInfoStore) delete(key types.NamespacedName) {
s.lock.Lock()
defer s.lock.Unlock()
delete(s.providers, key.String())
}
func (s *providerInfoStore) get(key types.NamespacedName) (*providerInfo, bool) {
s.lock.RLock()
defer s.lock.RUnlock()
info, ok := s.providers[key.String()]
return info, ok
import (
"maps"
"sync"
"k8s.io/apimachinery/pkg/types"
)
func (s *providerInfoStore) addOrUpdate(key types.NamespacedName, info *providerInfo) {
s.lock.Lock()
defer s.lock.Unlock()
s.providers[key.String()] = cloneProviderInfo(info)
}
func (s *providerInfoStore) delete(key types.NamespacedName) {
s.lock.Lock()
defer s.lock.Unlock()
delete(s.providers, key.String())
}
func (s *providerInfoStore) get(key types.NamespacedName) (*providerInfo, bool) {
s.lock.RLock()
defer s.lock.RUnlock()
info, ok := s.providers[key.String()]
if !ok {
return nil, false
}
return cloneProviderInfo(info), true
}
func cloneProviderInfo(in *providerInfo) *providerInfo {
if in == nil {
return nil
}
out := *in
out.config = maps.Clone(in.config)
return &out
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/plugins/model-provider-resolver/provider_store.go` around lines 44 - 60,
The store currently saves caller-owned *providerInfo pointers in addOrUpdate and
returns internal pointers from get, allowing callers to mutate providerInfo
(notably the config map) outside s.lock and cause races; fix by making and
storing a defensive deep copy of providerInfo (including a cloned config map)
inside addOrUpdate while holding s.lock, and have get return a copy (not the
internal pointer) created while holding s.lock (implement a helper like
cloneProviderInfo to copy fields and maps so no caller-owned pointer escapes the
lock on read or write).

@yossiovadia
Copy link
Copy Markdown
Contributor Author

E2E Validation — Multi-Provider Traffic Splitting

Tested on maas-experimental Kind cluster with llm-katan simulator. Created an ExternalModel with 80/20 weighted providers (OpenAI/Anthropic), both pointing to the same llm-katan backend.

Setup

apiVersion: inference.opendatahub.io/v1alpha1
kind: ExternalModel
metadata:
  name: multi-provider-test
  namespace: llm
spec:
  externalProviderRefs:
    - ref: {name: katan-openai-provider}
      targetModel: llm-katan-echo
      apiFormat: openai
      weight: 80
    - ref: {name: katan-anthropic-provider}
      targetModel: llm-katan-echo
      apiFormat: anthropic
      weight: 20

HTTPRoute created by controller

3 rules automatically generated:

Rule 1: PathPrefix /llm/multi-provider-test  → default backend (openai)
Rule 2: Header X-Selected-Provider: katan-openai-provider → openai backend + Host: 3-13-21-181.sslip.io
Rule 3: Header X-Selected-Provider: katan-anthropic-provider → anthropic backend + Host: 3-13-21-181.sslip.io

Traffic distribution (20 requests)

OpenAI:    14 (70%)
Anthropic:  6 (30%)

Close to configured 80/20 — variance is expected with small sample size. Provider selection is identified by response ID format: chatcmpl-* = OpenAI translator, msg_* = Anthropic translator.

Verified behaviors

Test Result
Single-provider models still work HTTP 200
Multi-provider model returns 200 HTTP 200
Traffic splits across both providers 14/6 (OpenAI/Anthropic)
Each provider gets correct translation OpenAI path, Anthropic path confirmed by response format
Model field rewritten to targetModel Response shows llm-katan-echo, not multi-provider-test
X-Selected-Provider header set by BBR Envoy routes deterministically by header match

Larger sample (50 requests, from E2E test PR #214)

OpenAI:    40 (80%)
Anthropic: 10 (20%)

@yossiovadia
Copy link
Copy Markdown
Contributor Author

Note on file count: This PR shows 43 changed files because it's stacked on top of unmerged PRs (#182#183#184#212). GitHub shows the full diff from main, not just this PR's delta.

The actual changes in this PR are ~14 files (CRD Weight field, multi-provider HTTPRoute builder, weighted provider selection in plugin, tests). Once the dependency chain merges, GitHub will automatically recalculate the diff to show only this PR's changes.

@nirrozenbaum
Copy link
Copy Markdown
Contributor

@yossiovadia going to close this PR.
can you please create a new PR dedicated only for weight field on the CRD, that if missing defaults to weight 1?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants