Skip to content

Remove enforceServers and image validation from MCPRegistry#4720

Closed
ChrisJBurns wants to merge 1 commit intomainfrom
chrisburns/fix-image-validation-configyaml
Closed

Remove enforceServers and image validation from MCPRegistry#4720
ChrisJBurns wants to merge 1 commit intomainfrom
chrisburns/fix-image-validation-configyaml

Conversation

@ChrisJBurns
Copy link
Copy Markdown
Collaborator

@ChrisJBurns ChrisJBurns commented Apr 9, 2026

Summary

The enforceServers feature has been broken since PR #2568 removed data sources from the operator. The {name}-registry-storage ConfigMap that image validation reads from is never created — nothing populates it. As a result, enforceServers: true silently does nothing: the ConfigMap lookup returns NotFound and the validator returns false for every check. Rather than fixing a feature with no working implementation, this PR removes it entirely.

Fixes #4717

Medium level
  • Removed EnforceServers field from MCPRegistrySpec and the GetStorageName() helper that only existed to support it
  • Deleted the entire image validation subsystem (image_validation.go, its tests, and the ImageValidator interface)
  • Stripped image validation wiring from both MCPServer and EmbeddingServer controllers — the ImageValidation field, the validateImage method, the condition-setting code, and the setImageValidationCondition helper
  • Removed the ConditionImageValidated status condition and all related reason constants
  • Cleaned up main.go (removed imageValidation variable and enableRegistry parameter pass-through)
  • Deleted the enforcing example and regenerated CRD manifests
Low level
File Change
cmd/thv-operator/api/v1alpha1/mcpregistry_types.go Remove EnforceServers field and GetStorageName() method
cmd/thv-operator/api/v1alpha1/mcpserver_types.go Remove ConditionImageValidated and all ConditionReasonImageValidation* constants
cmd/thv-operator/api/v1alpha1/embeddingserver_types.go Update comment referencing removed condition
cmd/thv-operator/controllers/mcpserver_controller.go Remove image validation block, setImageValidationCondition helper, unused goerr import
cmd/thv-operator/controllers/embeddingserver_controller.go Remove ImageValidation field, validateImage method, validation call site
cmd/thv-operator/controllers/embeddingserver_controller_test.go Remove TestValidateImage and ImageValidation field from reconciler literals
cmd/thv-operator/controllers/mcpserver_test_helpers_test.go Remove ImageValidation field from test reconciler
cmd/thv-operator/main.go Remove imageValidation variable, enableRegistry param from setupServerControllers, validation import
cmd/thv-operator/pkg/validation/image_validation.go Deleted
cmd/thv-operator/pkg/validation/image_validation_test.go Deleted
cmd/thv-operator/pkg/validation/cedar_validation.go Add package comment (was on deleted file)
cmd/thv-operator/pkg/registryapi/manager.go Remove dead getConfigMapName function
cmd/thv-operator/pkg/registryapi/types_test.go Remove TestGetConfigMapName and GetStorageName assertions from TestMCPRegistryHelperMethods
cmd/thv-operator/test-integration/embedding-server/suite_test.go Remove ImageValidation field and validation import
deploy/charts/operator-crds/*/toolhive.stacklok.dev_mcpregistries.yaml Regenerated — enforceServers field removed
examples/operator/mcp-registries/mcpregistry-enforcing.yaml Deleted

Type of change

  • Refactoring (no behavior change)

Test plan

  • Linting (task lint-fix) — 0 issues
  • Unit tests (go test ./cmd/thv-operator/...) — all pass
  • Build (go build ./cmd/thv-operator/...) — passes

Does this introduce a user-facing change?

The enforceServers field is removed from the MCPRegistry CRD. This is a no-op in practice — the feature has been silently broken since the data sources were moved into the registry server (#2568), so no existing deployment depends on it working.

Special notes for reviewers

  • docs/operator/crd-api.md still references enforceServerstask crdref-gen could not run from the worktree due to a path resolution issue. It should be regenerated from the main repo before merge.
  • cmd/thv-operator/REGISTRY.md and docs/arch/06-registry-system.md contain prose references to enforceServers that should be cleaned up in a follow-up documentation PR.

Generated with Claude Code

The image validation logic used GetStorageName() to look up a
{name}-registry-storage ConfigMap that was only created by the legacy
typed config path. With the configYAML path, this ConfigMap does not
exist, causing enforceServers to silently degrade — the lookup returns
NotFound and checkImageInRegistry returns false.

Replace the ConfigMap-based image lookup with HTTP queries to the
registry API service at Status.URL (/v0.1/servers endpoint). This
checks OCI package identifiers in the server list, which works
correctly with both legacy and configYAML paths.

Fixes #4717

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the size/L Large PR: 600-999 lines changed label Apr 9, 2026
@ChrisJBurns ChrisJBurns changed the title Query registry API instead of ConfigMap for image validation NOT TO BE MERGED! Query registry API instead of ConfigMap for image validation Apr 9, 2026
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 9, 2026
Comment on lines +287 to +295
return nil, "", fmt.Errorf("registry API returned status %d", resp.StatusCode)
}

var listResp v0.ServerListResponse
if err := json.NewDecoder(resp.Body).Decode(&listResp); err != nil {
return nil, "", fmt.Errorf("failed to decode registry API response: %w", err)
}

// Check servers in groups
// TODO: check with Rado or Ria, is this needed?
for _, group := range reg.Groups {
for _, server := range group.Servers {
if server.Image == image {
return listResp.Servers, listResp.Metadata.NextCursor, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 In fetchRegistryPage(), when resp.StatusCode != http.StatusOK, the function returns early without draining the response body. The deferred resp.Body.Close() closes it unread, preventing Go's HTTP transport from returning the TCP connection to the pool. Add io.Copy(io.Discard, resp.Body) before the early return to allow connection reuse on error paths.

Extended reasoning...

What the bug is and how it manifests

In fetchRegistryPage() (image_validation.go lines 287-295), when resp.StatusCode != http.StatusOK, the code returns an error immediately. A defer closes the body, but the body was never read. Per Go's net/http documentation, the HTTP/1.x transport can only reuse a keep-alive TCP connection if the response body is fully consumed before being closed. Without draining, the connection is closed rather than returned to the pool.

The specific code path that triggers it

The sequence in fetchRegistryPage is: (1) v.httpClient.Do(req) makes the HTTP call, (2) defer resp.Body.Close() is registered, (3) if resp.StatusCode != http.StatusOK returns an error immediately, (4) the deferred Close() runs on an unread body. On the success path the body is fully consumed by json.NewDecoder(resp.Body).Decode, so only the error path is affected.

Why existing code does not prevent it

The defer resp.Body.Close() pattern is correct for the success path but insufficient for error paths. Go's transport checks whether the body was fully read when Close() is called to decide if the connection can be returned to the pool. An unread body means the transport cannot safely reuse the connection and must close the TCP socket instead.

What the impact would be

Each non-200 response from the registry API wastes one TCP connection. Under normal operation this path is never hit. It only matters during registry errors (5xx, 4xx). For a cluster-internal operator admission webhook this is low-impact: non-200 responses should be rare, the connection pool is bounded, and the 10-second client timeout provides a natural backstop. All four verifiers agreed this is a nit-severity issue.

How to fix it

Add io.Copy(io.Discard, resp.Body) before the early return on non-200 status:

if resp.StatusCode \!= http.StatusOK {
    _, _ = io.Copy(io.Discard, resp.Body)
    return nil, "", fmt.Errorf("registry API returned status %d", resp.StatusCode)
}

Step-by-step proof

  1. fetchRegistryPage calls v.httpClient.Do(req) and receives a 503 response from a struggling registry.
  2. defer resp.Body.Close() is registered but not yet run.
  3. resp.StatusCode != http.StatusOK is true; the function returns an error immediately.
  4. The deferred Close() runs on an unread body (the 503 HTML error page was never consumed).
  5. Go's http.Transport sees an unread body on Close() and marks the connection as non-reusable, closing the underlying TCP socket.
  6. The next pagination call or the next admission webhook invocation must open a fresh TCP connection to the registry service rather than reusing the pooled one.

Comment on lines +121 to +131
return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
resp := v0.ServerListResponse{
Servers: servers,
Metadata: v0.Metadata{Count: len(servers)},
}
w.Header().Set("Content-Type", "application/json")
err := json.NewEncoder(w).Encode(resp)
require.NoError(t, err)
}))
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 In newRegistryAPIServer() (image_validation_test.go:128), require.NoError(t, err) is called inside the HTTP handler which runs in a goroutine managed by httptest.Server, not the test goroutine. Per Go testing docs, t.FailNow() (called by require on failure) must only be invoked from the test goroutine; calling it elsewhere invokes runtime.Goexit() on the handler goroutine, silently terminating it without failing the test. Replace with assert.NoError(t, err), which calls t.Fail() and is safe from any goroutine.

Extended reasoning...

What the bug is and how it manifests

The newRegistryAPIServer() helper function creates an httptest.Server with an HTTP handler that calls require.NoError(t, err) after encoding the JSON response. The require package's NoError function calls t.FailNow() if the error is non-nil. Go's testing documentation explicitly states: "FailNow must be called from the goroutine running the Test or Benchmark function, not from other goroutines created during the test."

The specific code path that triggers it

The HTTP handler runs in a separate goroutine managed by httptest.Server's internal HTTP server. This is a different goroutine from the one running the test function. At line 128:

err := json.NewEncoder(w).Encode(resp)
require.NoError(t, err)  // violates Go testing contract

If Encode fails, require.NoError calls t.FailNow()runtime.Goexit() on the handler goroutine.

Why existing code doesn't prevent it

The compiler and runtime do not enforce that t.FailNow() is only called from the test goroutine. The handler compiles and runs fine. The goroutine constraint is a semantic requirement documented in the testing package but not mechanically enforced.

What the impact would be

If json.NewEncoder(w).Encode(resp) were to fail, runtime.Goexit() would terminate only the HTTP handler goroutine, not the test goroutine. The test continues running without being marked as failed. The client in the test goroutine would receive a partial or truncated response, potentially causing a JSON decode error or other confusing failure rather than a clear test failure at the encoding step. The root cause (encoding failure) would be invisible in test output.

How to fix it

Replace require.NoError(t, err) with assert.NoError(t, err). The assert package calls t.Fail() (which marks the test failed without stopping the goroutine) rather than t.FailNow(), and t.Fail() is safe to call from any goroutine per the Go testing contract.

Step-by-step proof

  1. Test calls newRegistryAPIServer(t, servers) which starts an httptest.Server
  2. Later, the test's HTTP client issues GET /v0.1/servers
  3. httptest.Server dispatches the request in a new goroutine (goroutine B), while the test runs in goroutine A
  4. Inside goroutine B, json.NewEncoder(w).Encode(resp) fails (hypothetically)
  5. require.NoError(t, err)t.FailNow()runtime.Goexit() on goroutine B
  6. Goroutine B exits without writing a complete response; the test's t.Failed() remains false
  7. Goroutine A receives a broken HTTP response and may fail for the wrong reason, or pass if the test doesn't validate the response carefully
  8. The actual encoding error is never surfaced as a direct test failure

Comment on lines +233 to +250
cursor := ""

for {
servers, nextCursor, err := v.fetchRegistryPage(ctx, registryURL, cursor)
if err != nil {
return nil, err
}

allServers = append(allServers, servers...)

if nextCursor == "" {
break
}
cursor = nextCursor

// Safety limit to prevent infinite loops
if len(allServers) > 10000 {
return nil, fmt.Errorf("exceeded maximum server limit (10000)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The safety limit check in listRegistryServers() is unreachable for single-page responses: if len(allServers) > 10000 is placed after the if nextCursor == "" { break } statement, so a registry returning >10,000 items in a single page (ignoring the limit=100 request parameter) would bypass the cap entirely. The fix is to move the safety check before the break.

Extended reasoning...

What the bug is and how it manifests

In listRegistryServers() (image_validation.go, lines 233–250), after appending a page of results the code checks whether there is a next cursor and breaks immediately if there is none. The 10,000-server safety check lives below that break, so it is never evaluated for the final (or only) page of a response:

allServers = append(allServers, servers...)

if nextCursor == "" {
    break                   // ← exits the loop here for single-page responses
}
cursor = nextCursor

// Safety limit to prevent infinite loops
if len(allServers) > 10000 { // ← never reached when nextCursor is empty
    return nil, fmt.Errorf("exceeded maximum server limit (10000)")
}

The specific code path that triggers it

A registry API that ignores the limit=100 query parameter and returns all servers (say, 50,000 of them) in a single JSON response with an empty nextCursor will cause the loop to: (1) append all 50,000 entries to allServers, then (2) see nextCursor == "" and break — without ever checking the length.

Why existing code doesn't prevent it

The limit=100 request parameter is advisory; a non-compliant or compromised registry is free to ignore it. The 10-second HTTP client timeout does bound total transfer time, but json.NewDecoder(resp.Body).Decode() attempts to consume the entire body before returning — meaning a large single-page payload will be fully allocated in memory before any timeout can interrupt decoding at the application layer.

What the impact would be

A misconfigured or malicious registry served at MCPRegistry.Status.URL could force the operator to allocate unbounded memory in a single validation call, potentially causing OOM conditions in the operator pod. The comment says 'Safety limit to prevent infinite loops' but the error message says 'exceeded maximum server limit (10000)' — indicating the intent was a hard total-count cap, not just an infinite-loop guard.

How to fix it

Move the safety check before the break:

allServers = append(allServers, servers...)

if len(allServers) > 10000 {
    return nil, fmt.Errorf("exceeded maximum server limit (10000)")
}

if nextCursor == "" {
    break
}
cursor = nextCursor

Step-by-step proof

  1. Registry API is called with limit=100 but returns 20,000 items and nextCursor: "".
  2. fetchRegistryPage decodes the full body and returns 20,000 ServerResponse objects plus an empty cursor.
  3. Back in listRegistryServers, allServers = append(allServers, servers...) grows to 20,000 entries — all in memory.
  4. if nextCursor == "" { break } is true, so the loop exits.
  5. The if len(allServers) > 10000 check is never evaluated.
  6. listRegistryServers returns all 20,000 entries, and they are passed to findImageInServers for scanning.

Mitigating factors that keep severity low: the registry URL is set by a cluster operator (trusted source), the limit=100 parameter discourages large responses from well-behaved servers, and the 10-second timeout bounds wall-clock exposure. Nevertheless, the placement is a logical error relative to the stated invariant.

@ChrisJBurns ChrisJBurns changed the title NOT TO BE MERGED! Query registry API instead of ConfigMap for image validation Remove enforceServers and image validation from MCPRegistry Apr 13, 2026
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove enforceServers field and image validation from MCPRegistry

1 participant