NOT TO BE MERGED! Query registry API instead of ConfigMap for image validation#4720
NOT TO BE MERGED! Query registry API instead of ConfigMap for image validation#4720ChrisJBurns wants to merge 1 commit intomainfrom
Conversation
The image validation logic used GetStorageName() to look up a
{name}-registry-storage ConfigMap that was only created by the legacy
typed config path. With the configYAML path, this ConfigMap does not
exist, causing enforceServers to silently degrade — the lookup returns
NotFound and checkImageInRegistry returns false.
Replace the ConfigMap-based image lookup with HTTP queries to the
registry API service at Status.URL (/v0.1/servers endpoint). This
checks OCI package identifiers in the server list, which works
correctly with both legacy and configYAML paths.
Fixes #4717
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| return nil, "", fmt.Errorf("registry API returned status %d", resp.StatusCode) | ||
| } | ||
|
|
||
| var listResp v0.ServerListResponse | ||
| if err := json.NewDecoder(resp.Body).Decode(&listResp); err != nil { | ||
| return nil, "", fmt.Errorf("failed to decode registry API response: %w", err) | ||
| } | ||
|
|
||
| // Check servers in groups | ||
| // TODO: check with Rado or Ria, is this needed? | ||
| for _, group := range reg.Groups { | ||
| for _, server := range group.Servers { | ||
| if server.Image == image { | ||
| return listResp.Servers, listResp.Metadata.NextCursor, nil |
There was a problem hiding this comment.
🟡 In fetchRegistryPage(), when resp.StatusCode != http.StatusOK, the function returns early without draining the response body. The deferred resp.Body.Close() closes it unread, preventing Go's HTTP transport from returning the TCP connection to the pool. Add io.Copy(io.Discard, resp.Body) before the early return to allow connection reuse on error paths.
Extended reasoning...
What the bug is and how it manifests
In fetchRegistryPage() (image_validation.go lines 287-295), when resp.StatusCode != http.StatusOK, the code returns an error immediately. A defer closes the body, but the body was never read. Per Go's net/http documentation, the HTTP/1.x transport can only reuse a keep-alive TCP connection if the response body is fully consumed before being closed. Without draining, the connection is closed rather than returned to the pool.
The specific code path that triggers it
The sequence in fetchRegistryPage is: (1) v.httpClient.Do(req) makes the HTTP call, (2) defer resp.Body.Close() is registered, (3) if resp.StatusCode != http.StatusOK returns an error immediately, (4) the deferred Close() runs on an unread body. On the success path the body is fully consumed by json.NewDecoder(resp.Body).Decode, so only the error path is affected.
Why existing code does not prevent it
The defer resp.Body.Close() pattern is correct for the success path but insufficient for error paths. Go's transport checks whether the body was fully read when Close() is called to decide if the connection can be returned to the pool. An unread body means the transport cannot safely reuse the connection and must close the TCP socket instead.
What the impact would be
Each non-200 response from the registry API wastes one TCP connection. Under normal operation this path is never hit. It only matters during registry errors (5xx, 4xx). For a cluster-internal operator admission webhook this is low-impact: non-200 responses should be rare, the connection pool is bounded, and the 10-second client timeout provides a natural backstop. All four verifiers agreed this is a nit-severity issue.
How to fix it
Add io.Copy(io.Discard, resp.Body) before the early return on non-200 status:
if resp.StatusCode \!= http.StatusOK {
_, _ = io.Copy(io.Discard, resp.Body)
return nil, "", fmt.Errorf("registry API returned status %d", resp.StatusCode)
}
Step-by-step proof
- fetchRegistryPage calls v.httpClient.Do(req) and receives a 503 response from a struggling registry.
- defer resp.Body.Close() is registered but not yet run.
- resp.StatusCode != http.StatusOK is true; the function returns an error immediately.
- The deferred Close() runs on an unread body (the 503 HTML error page was never consumed).
- Go's http.Transport sees an unread body on Close() and marks the connection as non-reusable, closing the underlying TCP socket.
- The next pagination call or the next admission webhook invocation must open a fresh TCP connection to the registry service rather than reusing the pooled one.
| return httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, _ *http.Request) { | ||
| resp := v0.ServerListResponse{ | ||
| Servers: servers, | ||
| Metadata: v0.Metadata{Count: len(servers)}, | ||
| } | ||
| w.Header().Set("Content-Type", "application/json") | ||
| err := json.NewEncoder(w).Encode(resp) | ||
| require.NoError(t, err) | ||
| })) | ||
| } | ||
|
|
There was a problem hiding this comment.
🟡 In newRegistryAPIServer() (image_validation_test.go:128), require.NoError(t, err) is called inside the HTTP handler which runs in a goroutine managed by httptest.Server, not the test goroutine. Per Go testing docs, t.FailNow() (called by require on failure) must only be invoked from the test goroutine; calling it elsewhere invokes runtime.Goexit() on the handler goroutine, silently terminating it without failing the test. Replace with assert.NoError(t, err), which calls t.Fail() and is safe from any goroutine.
Extended reasoning...
What the bug is and how it manifests
The newRegistryAPIServer() helper function creates an httptest.Server with an HTTP handler that calls require.NoError(t, err) after encoding the JSON response. The require package's NoError function calls t.FailNow() if the error is non-nil. Go's testing documentation explicitly states: "FailNow must be called from the goroutine running the Test or Benchmark function, not from other goroutines created during the test."
The specific code path that triggers it
The HTTP handler runs in a separate goroutine managed by httptest.Server's internal HTTP server. This is a different goroutine from the one running the test function. At line 128:
err := json.NewEncoder(w).Encode(resp)
require.NoError(t, err) // violates Go testing contractIf Encode fails, require.NoError calls t.FailNow() → runtime.Goexit() on the handler goroutine.
Why existing code doesn't prevent it
The compiler and runtime do not enforce that t.FailNow() is only called from the test goroutine. The handler compiles and runs fine. The goroutine constraint is a semantic requirement documented in the testing package but not mechanically enforced.
What the impact would be
If json.NewEncoder(w).Encode(resp) were to fail, runtime.Goexit() would terminate only the HTTP handler goroutine, not the test goroutine. The test continues running without being marked as failed. The client in the test goroutine would receive a partial or truncated response, potentially causing a JSON decode error or other confusing failure rather than a clear test failure at the encoding step. The root cause (encoding failure) would be invisible in test output.
How to fix it
Replace require.NoError(t, err) with assert.NoError(t, err). The assert package calls t.Fail() (which marks the test failed without stopping the goroutine) rather than t.FailNow(), and t.Fail() is safe to call from any goroutine per the Go testing contract.
Step-by-step proof
- Test calls
newRegistryAPIServer(t, servers)which starts anhttptest.Server - Later, the test's HTTP client issues GET /v0.1/servers
httptest.Serverdispatches the request in a new goroutine (goroutine B), while the test runs in goroutine A- Inside goroutine B,
json.NewEncoder(w).Encode(resp)fails (hypothetically) require.NoError(t, err)→t.FailNow()→runtime.Goexit()on goroutine B- Goroutine B exits without writing a complete response; the test's
t.Failed()remainsfalse - Goroutine A receives a broken HTTP response and may fail for the wrong reason, or pass if the test doesn't validate the response carefully
- The actual encoding error is never surfaced as a direct test failure
| cursor := "" | ||
|
|
||
| for { | ||
| servers, nextCursor, err := v.fetchRegistryPage(ctx, registryURL, cursor) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| allServers = append(allServers, servers...) | ||
|
|
||
| if nextCursor == "" { | ||
| break | ||
| } | ||
| cursor = nextCursor | ||
|
|
||
| // Safety limit to prevent infinite loops | ||
| if len(allServers) > 10000 { | ||
| return nil, fmt.Errorf("exceeded maximum server limit (10000)") |
There was a problem hiding this comment.
🟡 The safety limit check in listRegistryServers() is unreachable for single-page responses: if len(allServers) > 10000 is placed after the if nextCursor == "" { break } statement, so a registry returning >10,000 items in a single page (ignoring the limit=100 request parameter) would bypass the cap entirely. The fix is to move the safety check before the break.
Extended reasoning...
What the bug is and how it manifests
In listRegistryServers() (image_validation.go, lines 233–250), after appending a page of results the code checks whether there is a next cursor and breaks immediately if there is none. The 10,000-server safety check lives below that break, so it is never evaluated for the final (or only) page of a response:
allServers = append(allServers, servers...)
if nextCursor == "" {
break // ← exits the loop here for single-page responses
}
cursor = nextCursor
// Safety limit to prevent infinite loops
if len(allServers) > 10000 { // ← never reached when nextCursor is empty
return nil, fmt.Errorf("exceeded maximum server limit (10000)")
}The specific code path that triggers it
A registry API that ignores the limit=100 query parameter and returns all servers (say, 50,000 of them) in a single JSON response with an empty nextCursor will cause the loop to: (1) append all 50,000 entries to allServers, then (2) see nextCursor == "" and break — without ever checking the length.
Why existing code doesn't prevent it
The limit=100 request parameter is advisory; a non-compliant or compromised registry is free to ignore it. The 10-second HTTP client timeout does bound total transfer time, but json.NewDecoder(resp.Body).Decode() attempts to consume the entire body before returning — meaning a large single-page payload will be fully allocated in memory before any timeout can interrupt decoding at the application layer.
What the impact would be
A misconfigured or malicious registry served at MCPRegistry.Status.URL could force the operator to allocate unbounded memory in a single validation call, potentially causing OOM conditions in the operator pod. The comment says 'Safety limit to prevent infinite loops' but the error message says 'exceeded maximum server limit (10000)' — indicating the intent was a hard total-count cap, not just an infinite-loop guard.
How to fix it
Move the safety check before the break:
allServers = append(allServers, servers...)
if len(allServers) > 10000 {
return nil, fmt.Errorf("exceeded maximum server limit (10000)")
}
if nextCursor == "" {
break
}
cursor = nextCursorStep-by-step proof
- Registry API is called with
limit=100but returns 20,000 items andnextCursor: "". fetchRegistryPagedecodes the full body and returns 20,000ServerResponseobjects plus an empty cursor.- Back in
listRegistryServers,allServers = append(allServers, servers...)grows to 20,000 entries — all in memory. if nextCursor == "" { break }is true, so the loop exits.- The
if len(allServers) > 10000check is never evaluated. listRegistryServersreturns all 20,000 entries, and they are passed tofindImageInServersfor scanning.
Mitigating factors that keep severity low: the registry URL is set by a cluster operator (trusted source), the limit=100 parameter discourages large responses from well-behaved servers, and the 10-second timeout bounds wall-clock exposure. Nevertheless, the placement is a logical error relative to the stated invariant.
Summary --- NOT TO BE MERGED!
The image validation logic (
enforceServers) silently degrades with the newconfigYAMLpath becausecheckImageInRegistryreads from a{name}-registry-storageConfigMap that only exists in the legacy typed config path. With the legacy path being removed (#4705),enforceServers: truehas no effect — the ConfigMap lookup returns NotFound and the function returns false.This PR replaces the ConfigMap-based image lookup with HTTP queries to the registry API service at
Status.URL(/v0.1/serversendpoint), checking OCI package identifiers for image presence. This works correctly with both legacy and configYAML paths.Fixes #4717
Medium level
checkImageInRegistrywith HTTP GET to the registry API's/v0.1/serversendpoint using the URL fromMCPRegistry.Status.URLlistRegistryServers/fetchRegistryPage) with a 10,000 server safety limitImageMetadata.Image(toolhive-core registry types) to checking OCI packages (registryType: "oci",identifierfield) in the MCP Registry API responsehttpClientfield toRegistryEnforcingValidatorwith a 10-second timeouthttptest.NewServer()serving registry API responsesLow level
cmd/thv-operator/pkg/validation/image_validation.gonet/http,net/url, andv0API types. AddhttpClientfield. RewritecheckImageInRegistryto query registry API. AddlistRegistryServers,fetchRegistryPage, andfindImageInServers. RemovefindImageInRegistryand ConfigMap-related code.cmd/thv-operator/pkg/validation/image_validation_test.gohttptest.NewServer. Add test helpers (newRegistryAPIServer,makeOCIServerResponse). Update all test cases to useStatus.URLand registry API responses. Add test cases for empty URL, API errors, non-OCI packages, and multiple packages.Type of change
Test plan
task test) — all 27 validation tests passtask lint-fix) — 0 issuesgo test ./cmd/thv-operator/...) — all unit tests pass; only pre-existing integration test failures (missing kubebuilder binaries)Does this introduce a user-facing change?
Yes —
enforceServers: trueon MCPRegistry resources using theconfigYAMLpath will now correctly enforce image validation. Previously it silently did nothing.Special notes for reviewers
GetStorageName()method andgetConfigMapName()function inregistryapi/manager.goare now unused by the validation package but retained since they may still be used by other code paths. They can be cleaned up separately if the legacy path is fully removed.registryType == "oci"on server packages rather than the legacyImageMetadata.Imagefield. This aligns with the MCP Registry API v0.1 specification.Generated with Claude Code