Skip to content

NOT TO BE MERGED! Query registry API instead of ConfigMap for image validation#4720

Open
ChrisJBurns wants to merge 1 commit intomainfrom
chrisburns/fix-image-validation-configyaml
Open

NOT TO BE MERGED! Query registry API instead of ConfigMap for image validation#4720
ChrisJBurns wants to merge 1 commit intomainfrom
chrisburns/fix-image-validation-configyaml

Conversation

@ChrisJBurns
Copy link
Copy Markdown
Collaborator

@ChrisJBurns ChrisJBurns commented Apr 9, 2026

Summary --- NOT TO BE MERGED!

The image validation logic (enforceServers) silently degrades with the new configYAML path because checkImageInRegistry reads from a {name}-registry-storage ConfigMap that only exists in the legacy typed config path. With the legacy path being removed (#4705), enforceServers: true has no effect — the ConfigMap lookup returns NotFound and the function returns false.

This PR replaces the ConfigMap-based image lookup with HTTP queries to the registry API service at Status.URL (/v0.1/servers endpoint), checking OCI package identifiers for image presence. This works correctly with both legacy and configYAML paths.

Fixes #4717

Medium level
  • Replaced ConfigMap read in checkImageInRegistry with HTTP GET to the registry API's /v0.1/servers endpoint using the URL from MCPRegistry.Status.URL
  • Added pagination support (listRegistryServers/fetchRegistryPage) with a 10,000 server safety limit
  • Changed image matching from ImageMetadata.Image (toolhive-core registry types) to checking OCI packages (registryType: "oci", identifier field) in the MCP Registry API response
  • Added httpClient field to RegistryEnforcingValidator with a 10-second timeout
  • Rewrote all tests from ConfigMap fixtures to httptest.NewServer() serving registry API responses
Low level
File Change
cmd/thv-operator/pkg/validation/image_validation.go Replace ConfigMap imports with net/http, net/url, and v0 API types. Add httpClient field. Rewrite checkImageInRegistry to query registry API. Add listRegistryServers, fetchRegistryPage, and findImageInServers. Remove findImageInRegistry and ConfigMap-related code.
cmd/thv-operator/pkg/validation/image_validation_test.go Replace all ConfigMap-based test fixtures with httptest.NewServer. Add test helpers (newRegistryAPIServer, makeOCIServerResponse). Update all test cases to use Status.URL and registry API responses. Add test cases for empty URL, API errors, non-OCI packages, and multiple packages.

Type of change

  • Bug fix

Test plan

  • Unit tests (task test) — all 27 validation tests pass
  • Linting (task lint-fix) — 0 issues
  • Broader operator tests (go test ./cmd/thv-operator/...) — all unit tests pass; only pre-existing integration test failures (missing kubebuilder binaries)

Does this introduce a user-facing change?

Yes — enforceServers: true on MCPRegistry resources using the configYAML path will now correctly enforce image validation. Previously it silently did nothing.

Special notes for reviewers

  • The GetStorageName() method and getConfigMapName() function in registryapi/manager.go are now unused by the validation package but retained since they may still be used by other code paths. They can be cleaned up separately if the legacy path is fully removed.
  • The HTTP client uses a 10-second timeout. For cluster-internal services this should be generous, but could be made configurable if needed.
  • The image matching now checks registryType == "oci" on server packages rather than the legacy ImageMetadata.Image field. This aligns with the MCP Registry API v0.1 specification.

Generated with Claude Code

The image validation logic used GetStorageName() to look up a
{name}-registry-storage ConfigMap that was only created by the legacy
typed config path. With the configYAML path, this ConfigMap does not
exist, causing enforceServers to silently degrade — the lookup returns
NotFound and checkImageInRegistry returns false.

Replace the ConfigMap-based image lookup with HTTP queries to the
registry API service at Status.URL (/v0.1/servers endpoint). This
checks OCI package identifiers in the server list, which works
correctly with both legacy and configYAML paths.

Fixes #4717

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added the size/L Large PR: 600-999 lines changed label Apr 9, 2026
@ChrisJBurns ChrisJBurns changed the title Query registry API instead of ConfigMap for image validation NOT TO BE MERGED! Query registry API instead of ConfigMap for image validation Apr 9, 2026
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Apr 9, 2026
Comment on lines +287 to +295
return nil, "", fmt.Errorf("registry API returned status %d", resp.StatusCode)
}

var listResp v0.ServerListResponse
if err := json.NewDecoder(resp.Body).Decode(&listResp); err != nil {
return nil, "", fmt.Errorf("failed to decode registry API response: %w", err)
}

// Check servers in groups
// TODO: check with Rado or Ria, is this needed?
for _, group := range reg.Groups {
for _, server := range group.Servers {
if server.Image == image {
return listResp.Servers, listResp.Metadata.NextCursor, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 In fetchRegistryPage(), when resp.StatusCode != http.StatusOK, the function returns early without draining the response body. The deferred resp.Body.Close() closes it unread, preventing Go's HTTP transport from returning the TCP connection to the pool. Add io.Copy(io.Discard, resp.Body) before the early return to allow connection reuse on error paths.

Extended reasoning...

What the bug is and how it manifests

In fetchRegistryPage() (image_validation.go lines 287-295), when resp.StatusCode != http.StatusOK, the code returns an error immediately. A defer closes the body, but the body was never read. Per Go's net/http documentation, the HTTP/1.x transport can only reuse a keep-alive TCP connection if the response body is fully consumed before being closed. Without draining, the connection is closed rather than returned to the pool.

The specific code path that triggers it

The sequence in fetchRegistryPage is: (1) v.httpClient.Do(req) makes the HTTP call, (2) defer resp.Body.Close() is registered, (3) if resp.StatusCode != http.StatusOK returns an error immediately, (4) the deferred Close() runs on an unread body. On the success path the body is fully consumed by json.NewDecoder(resp.Body).Decode, so only the error path is affected.

Why existing code does not prevent it

The defer resp.Body.Close() pattern is correct for the success path but insufficient for error paths. Go's transport checks whether the body was fully read when Close() is called to decide if the connection can be returned to the pool. An unread body means the transport cannot safely reuse the connection and must close the TCP socket instead.

What the impact would be

Each non-200 response from the registry API wastes one TCP connection. Under normal operation this path is never hit. It only matters during registry errors (5xx, 4xx). For a cluster-internal operator admission webhook this is low-impact: non-200 responses should be rare, the connection pool is bounded, and the 10-second client timeout provides a natural backstop. All four verifiers agreed this is a nit-severity issue.

How to fix it

Add io.Copy(io.Discard, resp.Body) before the early return on non-200 status:

if resp.StatusCode \!= http.StatusOK {
    _, _ = io.Copy(io.Discard, resp.Body)
    return nil, "", fmt.Errorf("registry API returned status %d", resp.StatusCode)
}

Step-by-step proof

  1. fetchRegistryPage calls v.httpClient.Do(req) and receives a 503 response from a struggling registry.
  2. defer resp.Body.Close() is registered but not yet run.
  3. resp.StatusCode != http.StatusOK is true; the function returns an error immediately.
  4. The deferred Close() runs on an unread body (the 503 HTML error page was never consumed).
  5. Go's http.Transport sees an unread body on Close() and marks the connection as non-reusable, closing the underlying TCP socket.
  6. The next pagination call or the next admission webhook invocation must open a fresh TCP connection to the registry service rather than reusing the pooled one.

Comment on lines +121 to +131
return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) {
resp := v0.ServerListResponse{
Servers: servers,
Metadata: v0.Metadata{Count: len(servers)},
}
w.Header().Set("Content-Type", "application/json")
err := json.NewEncoder(w).Encode(resp)
require.NoError(t, err)
}))
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 In newRegistryAPIServer() (image_validation_test.go:128), require.NoError(t, err) is called inside the HTTP handler which runs in a goroutine managed by httptest.Server, not the test goroutine. Per Go testing docs, t.FailNow() (called by require on failure) must only be invoked from the test goroutine; calling it elsewhere invokes runtime.Goexit() on the handler goroutine, silently terminating it without failing the test. Replace with assert.NoError(t, err), which calls t.Fail() and is safe from any goroutine.

Extended reasoning...

What the bug is and how it manifests

The newRegistryAPIServer() helper function creates an httptest.Server with an HTTP handler that calls require.NoError(t, err) after encoding the JSON response. The require package's NoError function calls t.FailNow() if the error is non-nil. Go's testing documentation explicitly states: "FailNow must be called from the goroutine running the Test or Benchmark function, not from other goroutines created during the test."

The specific code path that triggers it

The HTTP handler runs in a separate goroutine managed by httptest.Server's internal HTTP server. This is a different goroutine from the one running the test function. At line 128:

err := json.NewEncoder(w).Encode(resp)
require.NoError(t, err)  // violates Go testing contract

If Encode fails, require.NoError calls t.FailNow()runtime.Goexit() on the handler goroutine.

Why existing code doesn't prevent it

The compiler and runtime do not enforce that t.FailNow() is only called from the test goroutine. The handler compiles and runs fine. The goroutine constraint is a semantic requirement documented in the testing package but not mechanically enforced.

What the impact would be

If json.NewEncoder(w).Encode(resp) were to fail, runtime.Goexit() would terminate only the HTTP handler goroutine, not the test goroutine. The test continues running without being marked as failed. The client in the test goroutine would receive a partial or truncated response, potentially causing a JSON decode error or other confusing failure rather than a clear test failure at the encoding step. The root cause (encoding failure) would be invisible in test output.

How to fix it

Replace require.NoError(t, err) with assert.NoError(t, err). The assert package calls t.Fail() (which marks the test failed without stopping the goroutine) rather than t.FailNow(), and t.Fail() is safe to call from any goroutine per the Go testing contract.

Step-by-step proof

  1. Test calls newRegistryAPIServer(t, servers) which starts an httptest.Server
  2. Later, the test's HTTP client issues GET /v0.1/servers
  3. httptest.Server dispatches the request in a new goroutine (goroutine B), while the test runs in goroutine A
  4. Inside goroutine B, json.NewEncoder(w).Encode(resp) fails (hypothetically)
  5. require.NoError(t, err)t.FailNow()runtime.Goexit() on goroutine B
  6. Goroutine B exits without writing a complete response; the test's t.Failed() remains false
  7. Goroutine A receives a broken HTTP response and may fail for the wrong reason, or pass if the test doesn't validate the response carefully
  8. The actual encoding error is never surfaced as a direct test failure

Comment on lines +233 to +250
cursor := ""

for {
servers, nextCursor, err := v.fetchRegistryPage(ctx, registryURL, cursor)
if err != nil {
return nil, err
}

allServers = append(allServers, servers...)

if nextCursor == "" {
break
}
cursor = nextCursor

// Safety limit to prevent infinite loops
if len(allServers) > 10000 {
return nil, fmt.Errorf("exceeded maximum server limit (10000)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The safety limit check in listRegistryServers() is unreachable for single-page responses: if len(allServers) > 10000 is placed after the if nextCursor == "" { break } statement, so a registry returning >10,000 items in a single page (ignoring the limit=100 request parameter) would bypass the cap entirely. The fix is to move the safety check before the break.

Extended reasoning...

What the bug is and how it manifests

In listRegistryServers() (image_validation.go, lines 233–250), after appending a page of results the code checks whether there is a next cursor and breaks immediately if there is none. The 10,000-server safety check lives below that break, so it is never evaluated for the final (or only) page of a response:

allServers = append(allServers, servers...)

if nextCursor == "" {
    break                   // ← exits the loop here for single-page responses
}
cursor = nextCursor

// Safety limit to prevent infinite loops
if len(allServers) > 10000 { // ← never reached when nextCursor is empty
    return nil, fmt.Errorf("exceeded maximum server limit (10000)")
}

The specific code path that triggers it

A registry API that ignores the limit=100 query parameter and returns all servers (say, 50,000 of them) in a single JSON response with an empty nextCursor will cause the loop to: (1) append all 50,000 entries to allServers, then (2) see nextCursor == "" and break — without ever checking the length.

Why existing code doesn't prevent it

The limit=100 request parameter is advisory; a non-compliant or compromised registry is free to ignore it. The 10-second HTTP client timeout does bound total transfer time, but json.NewDecoder(resp.Body).Decode() attempts to consume the entire body before returning — meaning a large single-page payload will be fully allocated in memory before any timeout can interrupt decoding at the application layer.

What the impact would be

A misconfigured or malicious registry served at MCPRegistry.Status.URL could force the operator to allocate unbounded memory in a single validation call, potentially causing OOM conditions in the operator pod. The comment says 'Safety limit to prevent infinite loops' but the error message says 'exceeded maximum server limit (10000)' — indicating the intent was a hard total-count cap, not just an infinite-loop guard.

How to fix it

Move the safety check before the break:

allServers = append(allServers, servers...)

if len(allServers) > 10000 {
    return nil, fmt.Errorf("exceeded maximum server limit (10000)")
}

if nextCursor == "" {
    break
}
cursor = nextCursor

Step-by-step proof

  1. Registry API is called with limit=100 but returns 20,000 items and nextCursor: "".
  2. fetchRegistryPage decodes the full body and returns 20,000 ServerResponse objects plus an empty cursor.
  3. Back in listRegistryServers, allServers = append(allServers, servers...) grows to 20,000 entries — all in memory.
  4. if nextCursor == "" { break } is true, so the loop exits.
  5. The if len(allServers) > 10000 check is never evaluated.
  6. listRegistryServers returns all 20,000 entries, and they are passed to findImageInServers for scanning.

Mitigating factors that keep severity low: the registry URL is set by a cluster operator (trusted source), the limit=100 parameter discourages large responses from well-behaved servers, and the 10-second timeout bounds wall-clock exposure. Nevertheless, the placement is a logical error relative to the stated invariant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix image_validation.go GetStorageName() for configYAML path

1 participant