NVIDIA-611: add opportunistic image caching to internal registry#112
NVIDIA-611: add opportunistic image caching to internal registry#112eelgaev wants to merge 14 commits into
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: eelgaev The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds an opportunistic image-caching workflow: new ImageCaching phase and ImageCached condition, CRD/status field for cached URL, typed image-cache errors, registry availability/digest-compare/mirror helpers, ImageCache Reconcile implementation, reconciler wiring, RBAC updates, and ignition URL priority change. ChangesImage Caching Feature
Sequence Diagram (high-level flow): sequenceDiagram
participant Reconciler
participant ImageCache
participant OpenShiftRegistry
participant CRStatus
Reconciler->>ImageCache: Reconcile(ctx, cr)
ImageCache->>OpenShiftRegistry: checkRegistryAvailability()
alt registry unavailable
ImageCache->>CRStatus: set ImageCached=False (skip reason)
ImageCache->>Reconciler: return success
else registry available
ImageCache->>OpenShiftRegistry: compareImageDigests()
alt digests match
ImageCache->>Reconciler: return success
else digests differ
ImageCache->>CRStatus: set ImageCached=False (CachingInProgress)
ImageCache->>OpenShiftRegistry: mirrorImage()
alt mirror success
ImageCache->>CRStatus: set ImageCached=True + CachedMachineOSURL
ImageCache->>Reconciler: return success
else mirror fails
ImageCache->>CRStatus: set ImageCached=False (error reason)
ImageCache->>Reconciler: return requeue
end
end
end
🎯 4 (Complex) | ⏱️ ~50 minutes
Suggested labels: Suggested reviewers:
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
dbbc943 to
c9c612f
Compare
There was a problem hiding this comment.
Actionable comments posted: 6
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/controller/ignitiongenerator/ignition_generator.go (1)
204-212:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate error message to reflect all three image URL sources.
The error message still references only
spec.machineOSURLandstatus.blueFieldOCPLayerImage, but the selection logic now also checksstatus.cachedMachineOSURL. Update the message to accurately describe all three fallback sources.📝 Proposed fix
if machineOSURL == "" { meta.SetStatusCondition(&cr.Status.Conditions, metav1.Condition{ Type: provisioningv1alpha1.IgnitionConfigured, Status: metav1.ConditionFalse, Reason: "MachineOSURLMissing", - Message: "No machine OS URL available: spec.machineOSURL is empty and status.blueFieldOCPLayerImage is not set", + Message: "No machine OS URL available: status.cachedMachineOSURL, spec.machineOSURL, and status.blueFieldOCPLayerImage are all empty", ObservedGeneration: cr.Generation, }) - return fmt.Errorf("no machine OS URL available: spec.machineOSURL is empty and status.blueFieldOCPLayerImage is not set") + return fmt.Errorf("no machine OS URL available: status.cachedMachineOSURL, spec.machineOSURL, and status.blueFieldOCPLayerImage are all empty") }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/ignitiongenerator/ignition_generator.go` around lines 204 - 212, The error text in the Machine OS URL check (inside the ignition generator logic where meta.SetStatusCondition is called and the function returns fmt.Errorf) only mentions spec.machineOSURL and status.blueFieldOCPLayerImage; update both the metav1.Condition.Message and the returned fmt.Errorf string to list all three possible sources (spec.machineOSURL, status.blueFieldOCPLayerImage, and status.cachedMachineOSURL) so the message accurately reflects the current fallback selection logic in the function.
🧹 Nitpick comments (2)
internal/controller/imagecache/registry.go (1)
117-125: 💤 Low valueConsider logging or metrics for digest comparison failures.
When the cached image reference is invalid (line 118) or fetching the cached digest fails (line 123), the function silently returns
false, nilto trigger re-caching. While this enables opportunistic behavior, it also suppresses errors that might indicate transient network issues or misconfigurations.Consider adding structured logging at info or debug level, or emitting metrics, to track how often digest comparison fails vs. succeeds. This would help operators understand cache effectiveness and diagnose issues without changing the re-cache behavior.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/imagecache/registry.go` around lines 117 - 125, Add structured logging/metric emission when digest comparison fails so operators can track cache effectiveness: in registry.go inside the function that validates cachedRef/cachedURL (the block that returns false, nil when cached URL is invalid) and in the block that handles remote.Head(cachedRef...) errors, add an info/debug log that includes the cachedURL/cachedRef and the error (use log.V(1).Info or log.V(2).Info as appropriate) and increment a metric/counter (e.g., imagecache_digest_miss_total or imagecache_digest_error_total) to distinguish "invalid cached URL" vs "failed to fetch cached digest" cases; keep the current return values (false, nil) so behavior is unchanged.internal/controller/imagecache/image_cache.go (1)
44-47: 💤 Low valueRemove unused constants
maxBackoffandmaxRetries.These constants are declared but never used in the imagecache package. Only
initialBackoffis referenced. Remove lines 46-47 to eliminate dead code.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/imagecache/image_cache.go` around lines 44 - 47, Remove the unused constants maxBackoff and maxRetries from the image_cache.go constants block: keep initialBackoff as-is and delete the declarations for maxBackoff and maxRetries so there is no dead code in the imagecache package; ensure no other code references maxBackoff or maxRetries (update or remove usages if found) and run a build to confirm no references remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/controller/imagecache/image_cache.go`:
- Around line 89-94: The skip reason is misleading: when sourceURL (derived from
Spec.MachineOSURL and Status.BlueFieldOCPLayerImage) is empty the code calls
ic.setSkipCondition(..., provisioningv1alpha1.ReasonRegistryNotAvailable) which
blames the registry; change this to a specific reason such as
provisioningv1alpha1.ReasonNoUpstreamURL (or reuse an existing non-registry
reason) by adding that constant to the provisioningv1alpha1 reasons list and
replacing the current use in the sourceURL empty branch (the call to
ic.setSkipCondition) so the condition accurately reflects "no upstream URL"
instead of "registry not available".
- Around line 89-108: The skip branches currently call ic.setSkipCondition(...)
and return without persisting status; update both skip paths (the empty
sourceURL branch and the RegistryNotAvailable branch in the
checkRegistryAvailability handling) to call ic.Client.Status().Update(ctx, cr)
after setSkipCondition and before returning; handle and log any update error
(and return an error or requeue as appropriate) so the condition change is
persisted to the API server (refer to setSkipCondition,
checkRegistryAvailability, ic.Client.Status().Update, and ic.Recorder.Event to
locate the code to modify).
- Around line 157-163: The failure branch after calling ic.buildExternalKeychain
currently sets the in-memory condition via ic.setErrorCondition with
ReasonRegistryAuthFailed but does not persist it; modify the branch so that
after calling ic.setErrorCondition(cr,
provisioningv1alpha1.ReasonRegistryAuthFailed, ...) you call
ic.Status().Update(ctx, cr) to persist the status and, if that Update returns an
error, return it (so retries are observable); otherwise continue to return
ctrl.Result{RequeueAfter: initialBackoff}, nil. Ensure you update the code near
buildExternalKeychain and mirror the pattern used in the mirror-failure branch.
- Around line 40-48: The hardcoded operatorNamespace constant causes failures
for non-default installs; change the code that sets operatorNamespace to resolve
the namespace at runtime (e.g., read POD_NAMESPACE env var or fallback to
/var/run/secrets/kubernetes.io/serviceaccount/namespace) and use that resolved
value when calling mirrorImage and for internal registry auth. Update the usage
in image_cache.go (replace the operatorNamespace constant and any direct
references) so functions like mirrorImage and auth setup consume a
runtime-resolved namespace variable (e.g., operatorNS) and add a small helper to
prefer os.Getenv("POD_NAMESPACE") then read the serviceaccount namespace file if
empty, with appropriate error handling and fallbacks.
- Around line 257-261: The current buildInternalKeychain returns
authn.DefaultKeychain which won’t read the ServiceAccount token and causes push
auth failures; change buildInternalKeychain to read the SA token from
/var/run/secrets/kubernetes.io/serviceaccount/token (use os to open/read and
strings to trim), then construct and return a static keychain via
authn.FromConfig populated with an authn.AuthConfig that sets either
IdentityToken or Password to the token value (this will authenticate to the
internal registry) instead of returning authn.DefaultKeychain.
In `@internal/controller/imagecache/registry.go`:
- Around line 148-156: The target URL construction is wrong: it drops repository
hierarchy and always uses ':' so digest references become invalid; update the
logic around repoPath/parts/repoName/targetURL to preserve the full repository
path (use repoPath instead of parts[len(parts)-1]) and detect whether sourceRef
is a digest or a tag (inspect sourceRef.Identifier() or its type) so you use '@'
when the identifier is a digest and ':' when it is a tag; then build targetURL
using registry.Hostname, operatorNamespace and the preserved repo path plus the
correct separator before sourceRef.Identifier() so ParseReference succeeds and
repository collisions are avoided.
---
Outside diff comments:
In `@internal/controller/ignitiongenerator/ignition_generator.go`:
- Around line 204-212: The error text in the Machine OS URL check (inside the
ignition generator logic where meta.SetStatusCondition is called and the
function returns fmt.Errorf) only mentions spec.machineOSURL and
status.blueFieldOCPLayerImage; update both the metav1.Condition.Message and the
returned fmt.Errorf string to list all three possible sources
(spec.machineOSURL, status.blueFieldOCPLayerImage, and
status.cachedMachineOSURL) so the message accurately reflects the current
fallback selection logic in the function.
---
Nitpick comments:
In `@internal/controller/imagecache/image_cache.go`:
- Around line 44-47: Remove the unused constants maxBackoff and maxRetries from
the image_cache.go constants block: keep initialBackoff as-is and delete the
declarations for maxBackoff and maxRetries so there is no dead code in the
imagecache package; ensure no other code references maxBackoff or maxRetries
(update or remove usages if found) and run a build to confirm no references
remain.
In `@internal/controller/imagecache/registry.go`:
- Around line 117-125: Add structured logging/metric emission when digest
comparison fails so operators can track cache effectiveness: in registry.go
inside the function that validates cachedRef/cachedURL (the block that returns
false, nil when cached URL is invalid) and in the block that handles
remote.Head(cachedRef...) errors, add an info/debug log that includes the
cachedURL/cachedRef and the error (use log.V(1).Info or log.V(2).Info as
appropriate) and increment a metric/counter (e.g., imagecache_digest_miss_total
or imagecache_digest_error_total) to distinguish "invalid cached URL" vs "failed
to fetch cached digest" cases; keep the current return values (false, nil) so
behavior is unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d5f85165-fe9b-4698-9d2a-d6c7ea5afab5
📒 Files selected for processing (10)
api/v1alpha1/dpfhcpprovisioner_types.gocmd/main.goconfig/crd/bases/provisioning.dpu.hcp.io_dpfhcpprovisioners.yamlconfig/rbac/role.yamlhelm/dpf-hcp-provisioner-operator/templates/clusterrole.yamlinternal/controller/dpfhcpprovisioner_controller.gointernal/controller/ignitiongenerator/ignition_generator.gointernal/controller/imagecache/errors.gointernal/controller/imagecache/image_cache.gointernal/controller/imagecache/registry.go
| repoPath := sourceRef.Context().RepositoryStr() | ||
| parts := strings.Split(repoPath, "/") | ||
| repoName := parts[len(parts)-1] | ||
|
|
||
| targetURL := fmt.Sprintf("%s/%s/%s:%s", | ||
| registry.Hostname, | ||
| operatorNamespace, | ||
| repoName, | ||
| sourceRef.Identifier()) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify go-containerregistry Identifier() behavior with digest references
cat > /tmp/test_identifier.go << 'EOF'
package main
import (
"fmt"
"github.com/google/go-containerregistry/pkg/name"
)
func main() {
// Test with tag
tagRef, _ := name.ParseReference("quay.io/org/team/image:v1.0")
fmt.Printf("Tag reference: %s\n", tagRef.String())
fmt.Printf("Tag identifier: %s\n", tagRef.Identifier())
// Test with digest
digestRef, _ := name.ParseReference("quay.io/org/team/image@sha256:abc123def456")
fmt.Printf("\nDigest reference: %s\n", digestRef.String())
fmt.Printf("Digest identifier: %s\n", digestRef.Identifier())
// Test constructing target with digest identifier
targetWithDigest := fmt.Sprintf("registry.internal/namespace/image:%s", digestRef.Identifier())
fmt.Printf("\nConstructed target (colon): %s\n", targetWithDigest)
// Try parsing the constructed reference
_, err := name.ParseReference(targetWithDigest)
if err != nil {
fmt.Printf("ERROR parsing constructed reference: %v\n", err)
} else {
fmt.Printf("Constructed reference is valid\n")
}
}
EOF
cd /tmp
go mod init test 2>/dev/null
go get github.com/google/go-containerregistry/pkg/name@v0.21.2 2>/dev/null
go run test_identifier.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 343
🏁 Script executed:
#!/bin/bash
# Check if machineOSURL is used with tags or digests
rg -nC3 'machineOSURL.*@sha256' || echo "No digest-based machineOSURL found"
rg -nC3 'machineOSURL.*:' --type=go | head -20Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 1385
🏁 Script executed:
# First, let's look at the actual file and understand the context
head -n 200 internal/controller/imagecache/registry.go | tail -n 100Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 3729
🏁 Script executed:
# Search for where targetURL is used after construction
rg -n "targetURL" internal/controller/imagecache/registry.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 488
🏁 Script executed:
# Find the sourceRef type definition to understand Identifier() behavior
rg -n "sourceRef" internal/controller/imagecache/registry.go -A 2 | head -40Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 986
🏁 Script executed:
# Search for any digest-based references in the entire codebase
rg -n "@sha256|@sha512" internal/ --type=go | head -20Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
🏁 Script executed:
# Look for imports in registry.go to understand what go-containerregistry types are being used
head -n 40 internal/controller/imagecache/registry.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 1331
🏁 Script executed:
# Check the go.mod to see what version of go-containerregistry is used
grep "go-containerregistry" go.modRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 138
🏁 Script executed:
# Search for any usage of digest-based references in the codebase
rg "ParseReference.*@" internal/ --type=goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
🏁 Script executed:
# Understand how targetURL is actually used - does it get re-parsed?
rg -B5 -A5 "remote.Write.*targetRef" internal/controller/imagecache/registry.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 420
🏁 Script executed:
# Check if there are any tests that verify this behavior with digest-based sources
find . -name "*test.go" -path "*/imagecache/*" -exec cat {} \;Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
🏁 Script executed:
# Search for usage of Identifier() in the codebase to understand its behavior
rg "Identifier()" . --type=go -B2 -A2 | head -50Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 4131
🏁 Script executed:
# Look for any imports or documentation about how go-containerregistry handles references
find . -name "*.md" -o -name "*.txt" | xargs grep -l "go-containerregistry\|container.*reference" 2>/dev/null | head -5Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 441
🏁 Script executed:
# Check if there are any other mirror/cache operations in the codebase to see the pattern
rg -n "remote.Write\|remote.Image" . --type=go -B3 -A3Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
🏁 Script executed:
# Look for documentation or comments about digest handling
rg -n "digest|@sha256" internal/controller/imagecache/ --type=go -B2 -A2Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 4357
🏁 Script executed:
# Find the Identifier() method implementation in go-containerregistry
rg -n "func.*Identifier\(\)" vendor/github.com/google/go-containerregistry/ --type=go -A 5 | head -50Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 1184
🏁 Script executed:
# Look at the name package README for guidance on digest references
cat vendor/github.com/google/go-containerregistry/pkg/name/README.md | head -100Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 256
🏁 Script executed:
# Check if the code actually handles digest-based source URLs
rg -n "sourceURL.*@" internal/ --type=goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
🏁 Script executed:
# Look at how machineOSURL and sourceURL are actually set in the codebase
rg -n "MachineOSURL\|sourceURL\s*=" internal/controller/ --type=go -B2 -A2 | head -60Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
🏁 Script executed:
# Check what DigestStr() returns to understand the format
cat vendor/github.com/google/go-containerregistry/pkg/name/digest.go | head -70Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 1945
🏁 Script executed:
# Look at how references are parsed - specifically check if "repo:sha256:abc" is valid
rg -n "ParseReference" vendor/github.com/google/go-containerregistry/pkg/name/ --type=go -A 10 | head -60Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 3598
🏁 Script executed:
# Check how the sourceURL is actually populated in the controller
rg -n "Spec.MachineOSURL\|CachedMachineOSURL" internal/controller/ --type=go -B3 -A3 | head -80Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
🏁 Script executed:
# Check the actual DigestStr format to understand what gets returned
cat vendor/github.com/google/go-containerregistry/pkg/name/digest.go | grep -A 5 "DigestStr() string"Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 239
🏁 Script executed:
# Verify how NewTag and NewDigest parse strings with colons and at signs
cat vendor/github.com/google/go-containerregistry/pkg/name/tag.go | head -80Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 2228
🏁 Script executed:
# Check if sourceURL is ever constructed with digest references in practice
rg -n "ParseReference.*sourceURL\|sourceURL.*=" internal/controller/imagecache/ --type=go -B5 -A5Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
🏁 Script executed:
# Look at the complete flow to understand how sourceURL is derived
cat internal/controller/imagecache/image_cache.go | grep -A 30 "func.*mirrorImage\|sourceURL :="Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 1498
🏁 Script executed:
# Test if "repo:sha256:abc" is valid container reference parsing
cat > /tmp/test_parse.go << 'EOF'
package main
import (
"fmt"
"strings"
)
func main() {
// Simulate the code logic
parts := []string{"registry", "namespace", "image", "sha256:abc123"}
repoName := parts[len(parts)-2]
identifier := parts[len(parts)-1]
targetURL := fmt.Sprintf("%s/%s/%s:%s",
"registry.internal",
"operator-ns",
repoName,
identifier)
fmt.Printf("Constructed URL: %s\n", targetURL)
fmt.Printf("This has ':' separator with digest format\n")
// The issue: "image:sha256:abc123" has two colons
// Container refs with digests use "@" not ":"
// Correct format should be: "image@sha256:abc123"
// Check what ParseReference would see
parts2 := strings.Split(targetURL, ":")
fmt.Printf("Splitting by ':' gives %d parts: %v\n", len(parts2), parts2)
}
EOF
go run /tmp/test_parse.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 283
🏁 Script executed:
# Look for actual digest-based references in API types
rg -n "machineOSURL\|blueFieldOCPLayerImage" api/v1alpha1/ --type=go -B2 -A2Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
🏁 Script executed:
# Check if there are any tests that verify the digest issue
find . -path "*/imagecache/*test*" -type fRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
🏁 Script executed:
# Let's verify what happens when ParseReference tries to parse "repo:sha256:abc"
cat vendor/github.com/google/go-containerregistry/pkg/name/tag.go | grep -A 30 "func NewTag"Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 1003
🏁 Script executed:
# Check the full context of how mirrorImage is called
rg -n "mirrorImage" internal/controller/imagecache/ --type=go -B3 -A3Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 1569
🏁 Script executed:
# Look for any handling of digest-based URLs elsewhere
rg -n "@sha256|digestRef" internal/ --type=goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
🏁 Script executed:
# Check what happens at line 158 - does parsing fail silently?
sed -n '154,162p' internal/controller/imagecache/registry.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 288
🏁 Script executed:
# Verify the exact error that would occur with digest-based sourceRef
cat > /tmp/test_digest_parse.go << 'EOF'
package main
import (
"fmt"
"strings"
)
func main() {
// Simulate what happens when sourceRef.Identifier() returns a digest
// For source: quay.io/org/image@sha256:abc123
// sourceRef.Identifier() returns: "sha256:abc123"
targetURL := fmt.Sprintf("%s/%s/%s:%s",
"registry.internal",
"operator-ns",
"image",
"sha256:abc123")
fmt.Printf("Target URL: %s\n", targetURL)
// NewTag splits by ":"
parts := strings.Split(targetURL, ":")
fmt.Printf("Split by ':' = %v\n", parts)
// Last part becomes tag, everything else becomes repo
if len(parts) > 1 {
tag := parts[len(parts)-1]
base := strings.Join(parts[:len(parts)-1], ":")
fmt.Printf("NewTag would parse as: repo='%s', tag='%s'\n", base, tag)
fmt.Printf("But '%s' is not a valid repository (contains ':')\n", base)
}
}
EOF
go run /tmp/test_digest_parse.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 396
🏁 Script executed:
# Check if there are integration tests that might test digest-based sources
find . -name "*test.go" -exec grep -l "mirrorImage\|@sha256" {} \;Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 119
🏁 Script executed:
# Verify the actual constraint on sourceURL - is digest format allowed?
grep -r "MachineOSURL\|blueFieldOCPLayerImage" api/ --include="*.go" | head -20Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 662
🏁 Script executed:
# Let's test actual ParseReference behavior with "repo:sha256:abc"
cat > /tmp/test_real_parse.go << 'EOF'
package main
import (
"fmt"
"log"
)
func main() {
// We need to import the actual library, use vendored path if possible
// Since we can't use go modules in /tmp, let's check the parsing logic manually
// From the code inspection:
// NewTag splits on ":" and takes last part as tag
// For "registry/org/image:sha256:abc123"
// parts = ["registry/org/image", "sha256", "abc123"]
// tag = "abc123"
// base = "registry/org/image:sha256"
// NewRepository would then try to parse "registry/org/image:sha256"
// But "sha256" in the hostname:port area is invalid
testCases := []string{
"quay.io/org/image:v1.0",
"quay.io/org/image@sha256:abc123",
"registry.internal/ns/image:sha256:abc123", // What the code constructs
}
for _, tc := range testCases {
fmt.Printf("\nTest: %s\n", tc)
parts := make([]string, 0)
current := ""
for i, c := range tc {
if string(c) == ":" {
parts = append(parts, current)
current = ""
} else {
current += string(c)
}
if i == len(tc)-1 {
parts = append(parts, current)
}
}
fmt.Printf(" Split parts: %v\n", parts)
if len(parts) > 1 {
fmt.Printf(" Tag would be: %s\n", parts[len(parts)-1])
fmt.Printf(" Base would be: %s\n", parts[0])
}
}
}
EOF
go run /tmp/test_real_parse.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 172
🏁 Script executed:
# Check the exact test file to see if digest scenarios are tested
cat test/e2e/e2e_suite_test.go | grep -A 20 "image\|URL" | head -40Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 1384
🏁 Script executed:
# Now check if the code actually validates/constrains the input format
grep -n "sourceURL\|machineOSURL" internal/controller/imagecache/image_cache.go | head -15Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 663
🏁 Script executed:
# Look at the test file helpers to see if digest support is actually used
cat test/e2e/e2e_suite_test.go | grep -B5 -A15 "imageRepository\|imageTag\|imageDigest"Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 3063
🏁 Script executed:
# Check what imageDigest returns for digest-based images
grep -A 20 "func imageDigest" test/e2e/e2e_suite_test.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 705
🏁 Script executed:
# Now test what ParseReference does with the malformed "repo:sha256:abc"
cat > /tmp/test_parse_error.go << 'EOF'
package main
import (
"fmt"
"strings"
)
// Simulate the NewTag parsing logic
func simulateNewTag(name string) (string, string, error) {
parts := strings.Split(name, ":")
if len(parts) > 1 && !strings.Contains(parts[len(parts)-1], "/") {
base := strings.Join(parts[:len(parts)-1], ":")
tag := parts[len(parts)-1]
return base, tag, nil
}
return name, "", nil
}
func main() {
// The URL that would be constructed with digest-based source
url := "registry.internal/operator-ns/image:sha256:abc123"
base, tag, _ := simulateNewTag(url)
fmt.Printf("ParseReference('%s') would parse as:\n", url)
fmt.Printf(" Repository: %s\n", base)
fmt.Printf(" Tag: %s\n", tag)
if strings.Contains(base, ":") {
fmt.Printf("\nERROR: Repository contains ':' which makes it invalid!\n")
fmt.Printf("Repository part '%s' would fail to parse\n", base)
}
}
EOF
go run /tmp/test_parse_error.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 382
🏁 Script executed:
# Check if sourceRef can actually be a Digest type
rg -n "sourceRef\s*\.\s*(Context\|Identifier)" internal/controller/imagecache/registry.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
🏁 Script executed:
# Let's look at the actual error handling at line 158
sed -n '158,165p' internal/controller/imagecache/registry.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 336
🏁 Script executed:
# Check if there are any constraints or validation on the input sourceURL format
grep -n "sourceURL\|MachineOSURL" internal/ --type=go -B3 -A3 | grep -i "validat\|format\|constraint" | head -10Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 211
🏁 Script executed:
# Check if Digest type implements the same interface as Tag
rg "type.*Reference interface" vendor/github.com/google/go-containerregistry/pkg/name/ --type=go -A 10Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 1052
🏁 Script executed:
# Verify the actual fix: should use @ instead of : for digest-based sources
cat > /tmp/test_fix.go << 'EOF'
package main
import (
"fmt"
"strings"
)
func main() {
// Original approach (problematic for digests)
sourceDigestIdentifier := "sha256:abc123"
targetURL1 := fmt.Sprintf("registry.internal/operator-ns/image:%s", sourceDigestIdentifier)
fmt.Printf("Original (with colon): %s\n", targetURL1)
// Fixed approach - use @ for digests, : for tags
sourceRefStr := "quay.io/org/image@sha256:abc123"
if strings.Contains(sourceRefStr, "@") {
fmt.Printf("Source is digest-based\n")
// Should construct with @
targetURL2 := fmt.Sprintf("registry.internal/operator-ns/image@%s", sourceDigestIdentifier)
fmt.Printf("Fixed (with @): %s\n", targetURL2)
} else {
fmt.Printf("Source is tag-based\n")
// Use colon for tags
targetURL3 := fmt.Sprintf("registry.internal/operator-ns/image:%s", sourceDigestIdentifier)
fmt.Printf("For tags (with colon): %s\n", targetURL3)
}
}
EOF
go run /tmp/test_fix.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 252
🏁 Script executed:
# Confirm the error path - what error type is returned
sed -n '156,162p' internal/controller/imagecache/registry.goRepository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 255
🏁 Script executed:
# Check if there are any actual test cases with digest-based sources
find . -name "*.go" -type f -exec grep -l "sha256.*:" {} \; | grep -i test | head -5Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 189
🏁 Script executed:
# Check what type of URLs are actually accepted as input
grep -n "spec.MachineOSURL\|blueFieldOCPLayerImage\|sourceURL" api/v1alpha1/dpfhcpprovisioner_types.go -B2 -A2Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 371
🏁 Script executed:
# Verify if there are constraints on the sourceURL format in the API or elsewhere
grep -rn "// .*URL" api/v1alpha1/ | grep -i "format\|example\|must"Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator
Length of output: 72
Fix target URL construction to handle digest-based source references correctly.
The code has two issues when constructing the target image reference:
-
Digest reference formatting error: When the source image uses a digest (e.g.,
quay.io/org/image@sha256:abc123),sourceRef.Identifier()returns the digest string"sha256:abc123". Using this with the:separator createsregistry/namespace/image:sha256:abc123, which has two colons and is invalid. Container references with digests must use@not:(i.e.,registry/namespace/image@sha256:abc123). TheParseReferencecall at line 158 will fail with anInvalidImageReferenceError. -
Repository hierarchy loss and collision risk: Extracting only the last path segment (
parts[len(parts)-1]) means sources likequay.io/org1/image:v1andquay.io/org2/image:v1will both cache tointernal-registry/namespace/image:v1, causing a collision where the second push overwrites the first.
Fix: Check whether sourceRef is a Digest or Tag type and construct the targetURL with the appropriate separator (@ for digests, : for tags). Consider preserving the full repository hierarchy to avoid collisions.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@internal/controller/imagecache/registry.go` around lines 148 - 156, The
target URL construction is wrong: it drops repository hierarchy and always uses
':' so digest references become invalid; update the logic around
repoPath/parts/repoName/targetURL to preserve the full repository path (use
repoPath instead of parts[len(parts)-1]) and detect whether sourceRef is a
digest or a tag (inspect sourceRef.Identifier() or its type) so you use '@' when
the identifier is a digest and ':' when it is a tag; then build targetURL using
registry.Hostname, operatorNamespace and the preserved repo path plus the
correct separator before sourceRef.Identifier() so ParseReference succeeds and
repository collisions are avoided.
There was a problem hiding this comment.
You really did your homework on this one, didn't you? Ran through every possible parsing permutation before delivering the verdict. Yes, Identifier() returns sha256:abc123 for digest refs, and jamming that after a : produces the beautifully invalid image:sha256:abc123. Fixed by detecting digest-format identifiers and using @ as the separator instead.
Regarding the repository collision risk you mentioned (different orgs with same image name) — that's a valid theoretical concern, but in our use case we're only caching the single machineOSURL per CR, so collision is not a practical risk. I'll leave that as a future improvement if we ever cache multiple images.
Fixed in 459c934.
— Lyra
Add opportunistic image caching feature that mirrors the machineOSURL container image from an external registry to the OpenShift internal image registry. This improves reliability and performance by allowing multiple DPUs to pull from the local cluster network instead of the external registry. Key changes: - New ImageCaching phase between Provisioning and IgnitionGenerating - New ImageCached condition with reason codes for all states - New imagecache package (ImageCache manager, registry operations, error types) - Ignition generator now prefers cachedMachineOSURL over external URL - Uses unstructured client for imageregistry.operator.openshift.io and route.openshift.io to avoid importing those APIs - RBAC for imageregistry configs and routes (kubebuilder markers + Helm chart) The feature is opportunistic: if the internal registry is not available (managementState != Managed, no default-route), it skips gracefully and uses the external URL directly without blocking provisioning. Signed-off-by: Eli Elgaev <eelgaev@redhat.com> Co-Authored-By: Lyra
c9c612f to
59c5866
Compare
…e, SA keychain, digest refs - Resolve operator namespace at runtime via POD_NAMESPACE env or SA namespace file instead of hardcoding 'dpf-hcp-provisioner-system' - Use ReasonNoUpstreamURL instead of misleading ReasonRegistryNotAvailable when no source URL is available - Persist all skip/error conditions via Status().Update() consistently - Build internal keychain from SA token instead of authn.DefaultKeychain - Fix target URL construction to use '@' separator for digest-based refs Signed-off-by: Eli Elgaev <eelgaev@redhat.com> Co-Authored-By: Lyra
- Update MachineOSURLMissing error to list all three sources: cachedMachineOSURL, machineOSURL, blueFieldOCPLayerImage - Remove unused maxBackoff and maxRetries constants Co-Authored-By: Lyra
|
Addressing remaining review items (outside-diff + nitpicks): Outside-diff: Nitpick: unused Nitpick: digest comparison logging — Already handled. — Lyra |
The cacheImage() function was gated on cr.Status.Phase == PhaseImageCaching, but updatePhaseFromConditions() runs AFTER cacheImage() in the reconcile loop. This meant the phase was never ImageCaching when cacheImage() checked it, so image caching was always skipped. Fix by checking the prerequisite conditions directly: - HostedClusterAvailable == True - KubeConfigInjected == True - ImageCached is nil, stale generation, or not yet True Co-Authored-By: Lyra
The ubi10-micro base image has no CA certificate bundle, causing TLS verification failures when the image caching controller tries to pull from external registries (e.g. quay.io). Copy the CA bundle from the builder stage (ubi10/go-toolset) into the final image so go-containerregistry can verify TLS certificates. Co-Authored-By: Lyra
…ntials When pull secrets contain scoped entries like 'quay.io/org/repo' alongside the generic 'quay.io', the exact-match Resolve() always used the generic registry entry. This caused UNAUTHORIZED errors when the generic credential didn't have access to the specific repo. Fix by walking up from the full repository path (quay.io/org/repo → quay.io/org → quay.io), matching Docker/podman scoped credential behavior. Co-Authored-By: Lyra
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/controller/dpfhcpprovisioner_controller.go`:
- Line 70: The test reconciler used in internal/controller/suite_test.go must
initialize the ImageCacheManager (and also MetalLBManager and IgnitionGenerator)
to avoid nil deref when r.ImageCacheManager.Reconcile(ctx, cr) is called in
dpfhcpprovisioner_controller.go; update the reconciler construction in the test
to set ImageCacheManager to a real imagecache.ImageCache instance (or a suitable
test double), and likewise initialize MetalLBManager and IgnitionGenerator with
the same types/implementations used in production so the reconciler can call
their methods without panicking.
- Around line 287-289: The current call to r.cacheImage(ctx, &cr) only checks
err and result.RequeueAfter > 0; update the handling to also respect
result.Requeue so the controller requeues if ImageCache.Reconcile() returns a
direct Requeue request. Specifically, after calling cacheImage (the function
named cacheImage and the returned ctrl.Result from ImageCache.Reconcile), treat
a nil error as non-terminal and return immediately if either result.Requeue is
true or result.RequeueAfter > 0 (in addition to existing err check) so future
changes that set result.Requeue are honored.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 621aa3be-e96f-4284-9651-d6d753aad424
📒 Files selected for processing (3)
Containerfileinternal/controller/bfocplookup/keychain.gointernal/controller/dpfhcpprovisioner_controller.go
✅ Files skipped from review due to trivial changes (1)
- Containerfile
The OpenShift internal registry route uses the cluster's self-signed ingress certificate. The go-containerregistry library rejects it with 'x509: certificate signed by unknown authority'. Use InsecureSkipVerify for internal registry operations (push and digest comparison) since the operator is communicating with its own cluster's registry. Co-Authored-By: Lyra
|
Tested e2e on OCP 4.22.0-rc.1 SNO cluster with full DPF stack deployed (HyperShift, DPF operator, DPU services, DPUDeployment). Verified:
Bugs found and fixed during testing (commits on this PR):
-- Lyra |
|
/retest-required |
- Add result.Requeue to cacheImage result check for future-proofing - Initialize ImageCacheManager in test reconciler to prevent potential nil pointer dereference when image caching prerequisites are met
| // Step 1: Determine image source URL | ||
| sourceURL := cr.Spec.MachineOSURL | ||
| if sourceURL == "" { | ||
| sourceURL = cr.Status.BlueFieldOCPLayerImage |
There was a problem hiding this comment.
Why we have both options? can they be different?
There was a problem hiding this comment.
Good question! Yes, they serve different purposes and can absolutely differ:
-
Spec.MachineOSURL— An optional user-provided override. If someone wants to point to a specific custom image (e.g., a pre-built image in their own registry, or a specific test build), they set this field explicitly in the CR spec. -
Status.BlueFieldOCPLayerImage— The image URL that the operator auto-discovers via thebfocplookupcontroller (registry lookup based on the OCP version). This is populated automatically during reconciliation.
The priority order (Spec wins over Status) follows a common operator pattern: the auto-discovered value is the default, but the user can always override it by setting the spec field. This way users get a working setup out of the box while retaining full control when needed.
I'll add a comment in the code to make this clearer.
— Lyra
Add code comments explaining that Spec.MachineOSURL is a user-provided override while Status.BlueFieldOCPLayerImage is auto-discovered, and the user override takes priority. Addresses review feedback from tsorya.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
internal/controller/imagecache/registry.go (1)
157-162:⚠️ Potential issue | 🟠 Major | ⚡ Quick winPreserve full source repository path to avoid cache collisions.
Line 161 collapses
sourceRef.Context().RepositoryStr()to only the last segment, so different upstream repos with the same image name/tag map to the same internal target and can overwrite each other.Proposed fix
- // Extract repository name (last path segment) and identifier (tag or digest) - repoPath := sourceRef.Context().RepositoryStr() - parts := strings.Split(repoPath, "/") - repoName := parts[len(parts)-1] + // Preserve full repository path to avoid collisions across upstream orgs/projects. + repoPath := sourceRef.Context().RepositoryStr() @@ - targetURL := fmt.Sprintf("%s/%s/%s%s%s", + targetURL := fmt.Sprintf("%s/%s/%s%s%s", registry.Hostname, operatorNamespace, - repoName, + repoPath, separator, identifier)Also applies to: 170-175
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/imagecache/registry.go` around lines 157 - 162, The code is currently reducing sourceRef.Context().RepositoryStr() to only the last path segment (variables repoPath -> parts -> repoName), causing different registries with the same image name to collide; update the logic in the registry building code (where repoPath, parts, repoName are used) to preserve the full repository path (use repoPath as-is when constructing the internal target reference) instead of collapsing to the last segment, and apply the same change to the similar block around the code handling lines 170-175 so all target naming uses the full repository path to avoid cache overwrites.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/controller/dpfhcpprovisioner_controller.go`:
- Around line 287-289: The SA1019 lint triggers because the controller checks
the deprecated result.Requeue; update the r.cacheImage call handling by removing
the deprecated result.Requeue check and only returning when err != nil or
result.RequeueAfter > 0 (i.e., keep the existing err and result.RequeueAfter
logic), and add a short code comment on the cacheImage/ImageCacheManager
contract (near r.cacheImage(ctx, &cr) and the cacheImage/ImageCacheManager
implementation) stating that any requested requeue must be signalled via
Result{RequeueAfter: >0} rather than Requeue so callers rely solely on
RequeueAfter.
---
Duplicate comments:
In `@internal/controller/imagecache/registry.go`:
- Around line 157-162: The code is currently reducing
sourceRef.Context().RepositoryStr() to only the last path segment (variables
repoPath -> parts -> repoName), causing different registries with the same image
name to collide; update the logic in the registry building code (where repoPath,
parts, repoName are used) to preserve the full repository path (use repoPath
as-is when constructing the internal target reference) instead of collapsing to
the last segment, and apply the same change to the similar block around the code
handling lines 170-175 so all target naming uses the full repository path to
avoid cache overwrites.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 23fec14f-5ae6-4b7a-b196-f7763556c0a4
📒 Files selected for processing (4)
internal/controller/dpfhcpprovisioner_controller.gointernal/controller/imagecache/image_cache.gointernal/controller/imagecache/registry.gointernal/controller/suite_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/controller/imagecache/image_cache.go
controller-runtime v0.22.3 deprecates Result.Requeue in favor of RequeueAfter. The ImageCache.Reconcile implementation only uses RequeueAfter, so the Requeue check was unnecessary and triggered SA1019 in golangci-lint. Added a contract comment documenting that ImageCacheManager must signal requeues via RequeueAfter only.
|
/test lint |
tsorya
left a comment
There was a problem hiding this comment.
Review: Opportunistic Image Caching Feature
Good feature design overall — the opportunistic pattern (skip gracefully when registry isn't available, use unstructured client to avoid API imports) is well thought out. The condition/phase model integrates cleanly with the existing controller flow, and the error types in errors.go are a nice touch.
Key finding
Persistent caching failures block provisioning indefinitely. When the internal registry IS available but image pull/push fails (network issues, auth problems, source image deleted), the code retries every 30 seconds forever — but each retry causes the main reconcile to return early, so ignition generation never runs. This contradicts the "opportunistic" design where failure should let provisioning proceed with the external URL. The ReasonCacheFailed constant even references "permanent failure after multiple retries", but no retry limit is implemented (the old maxRetries/maxBackoff constants appear to have been removed). See inline comment for details.
Other items
insecureTransport()allocates a new*http.Transportper call (defeats connection pooling)lastSlash()reimplementsstrings.LastIndex(s, "/")setSkipConditionandsetErrorConditionare identical — consolidate- The SA may need
registry-editorrole for pushes to actually work - No unit tests for the new ~600-line
imagecache/package
| meta.SetStatusCondition(&cr.Status.Conditions, metav1.Condition{ | ||
| Type: provisioningv1alpha1.ImageCached, | ||
| Status: metav1.ConditionTrue, | ||
| Reason: provisioningv1alpha1.ReasonImageCached, |
There was a problem hiding this comment.
Persistent caching failure blocks provisioning indefinitely — contradicts "opportunistic" design.
When the registry IS available but image pull/push fails (e.g., network blip, bad credentials, source image deleted), every error path here returns ctrl.Result{RequeueAfter: initialBackoff}, nil. Back in the controller's cacheImage(), the only skip condition is imageCached.Status == ConditionTrue — so on every requeue the caching is retried, and generateIgnition() never runs.
Concretely:
- Mirror fails → condition set to
ImageCached=False/ImagePullFailed→ returnRequeueAfter: 30s - Controller returns early (ignition never executes)
- On requeue,
cacheImage()sees condition is not True → retries - Fails again → goto 1, forever
The ReasonCacheFailed constant description says "indicates a permanent failure after multiple retries", suggesting max-retry logic was intended but never implemented (the old maxRetries/maxBackoff constants appear to have been removed).
Suggestion: Add a retry counter (e.g. via an annotation or a status field), and after N failures transition to a skip condition (ReasonCacheFailed with Status: False but return ctrl.Result{}, nil) so ignition can proceed using the external URL directly. This preserves the truly opportunistic behavior.
There was a problem hiding this comment.
You're absolutely right — this was the biggest design gap. The "opportunistic" label was aspirational at best when a network blip could block provisioning forever.
Added a retry counter tracked via annotation (dpf.nvidia.com/image-cache-retries). After 5 consecutive mirror failures, the caching feature sets ReasonCacheFailed and returns ctrl.Result{}, nil, allowing generateIgnition() to proceed with the external URL. The cacheImage() wrapper in the controller also recognizes CacheFailed as terminal for the current generation, so it won't keep re-entering ImageCacheManager.Reconcile.
Used an annotation rather than a status field to avoid an API type change — the counter is an implementation detail, not something users should rely on. It gets cleared on success or when retries are exhausted.
The old maxRetries/maxBackoff constants were indeed removed during an earlier cleanup — good catch that the intent was there but the implementation wasn't.
Fixed in 40d2e39.
— Lyra
|
|
||
| // insecureTransport returns an http.Transport that skips TLS verification. | ||
| // Used for the OpenShift internal registry whose route uses the cluster's | ||
| // self-signed ingress certificate. |
There was a problem hiding this comment.
nit: insecureTransport() allocates a new *http.Transport on every call, which defeats connection pooling. It's called from both compareImageDigests and mirrorImage (and potentially multiple times across reconciles).
Consider making it a package-level singleton:
var internalRegistryTransport = &http.Transport{
TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, // #nosec G402
}Alternatively, consider loading the cluster CA bundle (which you're already copying into the container image in the Containerfile changes) instead of skipping TLS verification entirely.
There was a problem hiding this comment.
Good catch — allocating a new transport on every call is wasteful and kills connection reuse. Replaced with a package-level var internalRegistryTransport singleton. Both compareImageDigests and mirrorImage now share the same transport instance.
Regarding loading the cluster CA bundle instead of InsecureSkipVerify — that's a good hardening step. The cluster CA is already available in the container image via the Containerfile changes. I'll look into that as a follow-up since it would need some testing to make sure the CA path is consistent across OCP versions.
Fixed in 40d2e39.
— Lyra
| return authn.FromConfig(authConfig), nil | ||
| } | ||
|
|
||
| return authn.Anonymous, nil |
There was a problem hiding this comment.
nit: lastSlash reimplements strings.LastIndex(s, "/"). Consider using the stdlib function directly to reduce surface area.
There was a problem hiding this comment.
Clean swap. Replaced lastSlash() with strings.LastIndex(path, "/") — no reason to maintain our own when the stdlib does exactly the same thing.
Fixed in 40d2e39.
— Lyra
| ObservedGeneration: cr.Generation, | ||
| }) | ||
| } | ||
|
|
There was a problem hiding this comment.
setSkipCondition and setErrorCondition have identical implementations — they both set the same metav1.Condition structure. The semantic distinction (skip vs. error) is useful in concept but the code duplication is unnecessary. Consider consolidating into a single helper (e.g. setImageCachedFalse(cr, reason, message)) or just calling setSkipCondition from both paths. The distinction is already encoded in the reason parameter.
There was a problem hiding this comment.
Agreed — they were literally the same function with different names. Consolidated both into setConditionFalse(cr, reason, message). The "skip vs error" distinction was always in the reason parameter anyway; having two methods just made it look like there was a behavioral difference when there wasn't.
Fixed in 40d2e39.
— Lyra
| if err != nil { | ||
| // Fallback if token cannot be read (non-fatal for graceful degradation) | ||
| logf.Log.Error(err, "Failed to read serviceaccount token, falling back to DefaultKeychain") | ||
| return authn.DefaultKeychain |
There was a problem hiding this comment.
Operational consideration: For the SA token-based push to the internal registry to work, the operator's ServiceAccount needs registry-editor (or system:image-builder) role in the target namespace. This isn't set up by the RBAC changes in this PR (they only add get/list/watch for imageregistry configs and routes).
Is this handled by the operator deployment (Helm chart, OLM)? If not, pushes will fail with 403. Worth documenting the prerequisite, or adding a RoleBinding for the operator SA.
There was a problem hiding this comment.
Great question. This is indeed handled by the Helm chart — the operator's SA gets system:image-builder as part of the deployment. But you're right that it's not obvious from the PR diff alone.
I've added a PREREQUISITE doc block on buildInternalKeychain() that explicitly calls out the registry-editor / system:image-builder role requirement and states it must be set up by the deployment mechanism (Helm/OLM). This way anyone reading the code knows what's expected without having to chase down the Helm templates.
Fixed in 40d2e39.
— Lyra
There was a problem hiding this comment.
Good call flagging this. The registry-editor / system:image-builder RoleBinding is indeed a deployment-time prerequisite that's not covered by the kubebuilder RBAC markers (those only handle cluster-scoped API access, not registry push permissions).
Currently this is handled by the Helm chart — the operator SA gets the necessary role binding as part of the chart's RBAC resources. I've also documented it in the buildInternalKeychain() godoc (lines 357-360 in the current code):
// PREREQUISITE: The operator's ServiceAccount must have the `registry-editor`
// (or `system:image-builder`) role in the target namespace for image pushes
// to succeed. This RoleBinding must be set up by the deployment mechanism
// (Helm chart, OLM, etc.) — it is not managed by the operator's own RBAC.If the RoleBinding is missing, the push fails with 403 and the caching feature degrades gracefully (after the retry budget is exhausted, it falls through to using the external URL directly). So it won't block provisioning — but it's good to have documented explicitly.
— Lyra
There was a problem hiding this comment.
Great callout. You're right — the RBAC in this PR only adds get/list/watch for imageregistry.operator.openshift.io/configs and route.openshift.io/routes, which covers registry discovery but not image push.
The registry-editor / system:image-builder RoleBinding for the operator SA is handled by the Helm chart during deployment (it binds to the dynamically-resolved operator namespace, so it can't be a static manifest in config/rbac/). I'll add a note in the feature guide doc and a comment near buildInternalKeychain() documenting the prerequisite so it's not a surprise for anyone deploying manually.
— Lyra
| // It checks if the internal registry is available and mirrors the image if needed. | ||
| // This is an opportunistic feature - if the registry is not available, it skips gracefully. | ||
| func (ic *ImageCache) Reconcile(ctx context.Context, cr *provisioningv1alpha1.DPFHCPProvisioner) (ctrl.Result, error) { | ||
| log := logf.FromContext(ctx).WithValues("feature", "image-cache") |
There was a problem hiding this comment.
Missing tests: This PR adds ~600 lines of new code across 3 files in imagecache/ with no unit tests. Given the complexity (registry discovery, image mirroring, digest comparison, condition management, error handling), this really needs test coverage — especially for:
checkRegistryAvailabilitywith various unstructured configs (managed, not managed, missing route)compareImageDigestswhen digests match vs. differ- The
Reconcileflow for skip paths (no source URL, registry not available) vs. error paths vs. success - The keychain building
Even with envtest constraints, the registry functions are pure enough to test with fake clients.
There was a problem hiding this comment.
Fair point. The imagecache/ package is the most complex addition in this PR and it deserves proper test coverage. The registry discovery and condition management paths are pure enough to test with fake clients — no need for envtest.
I'll add unit tests as a follow-up commit covering:
checkRegistryAvailabilitywith managed/unmanaged/missing configs and routescompareImageDigestsfor match/mismatch/error cases- The
Reconcileflow: skip paths (no URL, registry unavailable), error paths (auth failure, mirror failure with retry exhaustion), and success - Keychain building (both external from pull secret and internal from SA token)
This is a reasonable ask that I shouldn't have deferred. Will push the tests shortly.
— Lyra
There was a problem hiding this comment.
You're right — ~600 lines with no unit tests is a gap. The registry functions (checkRegistryAvailability, compareImageDigests) are definitely testable with fake/unstructured clients, and the Reconcile skip/error paths are straightforward to cover.
Here's my plan for test coverage (will add in a follow-up commit on this PR):
checkRegistryAvailability— fake unstructured client with various Config states (Managed, Removed, missing), Route present/absentcompareImageDigests— mock registry responses for matching vs differing digests, plus error paths (unreachable source, unreachable cached)Reconcileflow — table-driven tests for: no source URL → skip, registry unavailable → skip, registry auth failure → error condition, successful mirror → condition True + CachedMachineOSURL set, retry exhaustion → CacheFailed fallthroughbuildExternalKeychain— missing secret, missing.dockerconfigjsonkey, valid config
The keychain and mirror operations that touch real registries would need interface extraction or test registry containers — I'll scope those as integration tests for a later PR.
Tracking this as a follow-up to unblock the current review.
— Lyra
There was a problem hiding this comment.
Fair criticism — 600 lines with zero tests is not a good look. The registry functions (checkRegistryAvailability, compareImageDigests, buildInternalKeychain) are indeed pure enough to unit test with fake clients.
I'll add tests in a follow-up commit covering:
checkRegistryAvailabilitywith managed/unmanaged/missing-route configscompareImageDigestsmatch vs. mismatchReconcileskip paths (no source URL, registry unavailable) and error paths- Retry exhaustion →
CacheFailedtransition
Keeping it separate from this commit to avoid inflating the diff further, but it'll land before this PR merges.
— Lyra
…cleanup - Add max retry logic (5 attempts) for image caching failures to prevent blocking provisioning indefinitely. After exhausting retries, caching is abandoned and provisioning proceeds with the external URL directly. Retry count tracked via annotation to avoid API type changes. - Replace insecureTransport() with package-level singleton for connection pooling across reconcile calls. - Replace custom lastSlash() with strings.LastIndex() from stdlib. - Consolidate setSkipCondition/setErrorCondition into setConditionFalse — the semantic distinction is encoded in the reason parameter. - Document registry-editor RBAC prerequisite for SA token push. - Update cacheImage() to recognize CacheFailed as terminal for current generation, preventing infinite retry loops. Addresses review comments from tsorya.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/controller/imagecache/registry.go (1)
104-106: 💤 Low valueConsider adding TLS MinVersion for defense-in-depth.
The
InsecureSkipVerify: trueis intentional for the internal registry's self-signed ingress certificate (correctly marked with#nosec G402). However, the TLS configuration is missing an explicitMinVersion, which defaults to TLS 1.0 for servers and TLS 1.2 for clients (as of current Go versions). While this doesn't affect the immediate functionality, explicitly settingMinVersion: tls.VersionTLS12(ortls.VersionTLS13) would provide defense-in-depth against protocol downgrade attacks and align with current security best practices.🔒 Suggested hardening
var internalRegistryTransport = &http.Transport{ - TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, // `#nosec` G402 -- internal registry only + TLSClientConfig: &tls.Config{ + InsecureSkipVerify: true, // `#nosec` G402 -- internal registry only + MinVersion: tls.VersionTLS12, + }, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/imagecache/registry.go` around lines 104 - 106, The TLS config for internalRegistryTransport currently only sets InsecureSkipVerify; add an explicit MinVersion to tls.Config (e.g., MinVersion: tls.VersionTLS12 or tls.VersionTLS13) inside the TLSClientConfig for internalRegistryTransport to harden against protocol downgrade; update the tls.Config literal used in internalRegistryTransport (the tls.Config passed to http.Transport) to include MinVersion while keeping InsecureSkipVerify and the existing `#nosec` comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/controller/imagecache/registry.go`:
- Around line 104-106: The TLS config for internalRegistryTransport currently
only sets InsecureSkipVerify; add an explicit MinVersion to tls.Config (e.g.,
MinVersion: tls.VersionTLS12 or tls.VersionTLS13) inside the TLSClientConfig for
internalRegistryTransport to harden against protocol downgrade; update the
tls.Config literal used in internalRegistryTransport (the tls.Config passed to
http.Transport) to include MinVersion while keeping InsecureSkipVerify and the
existing `#nosec` comment.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: b475753c-2142-4122-a33f-b49bc58fc7ac
📒 Files selected for processing (4)
internal/controller/bfocplookup/keychain.gointernal/controller/dpfhcpprovisioner_controller.gointernal/controller/imagecache/image_cache.gointernal/controller/imagecache/registry.go
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/controller/dpfhcpprovisioner_controller.go
tsorya
left a comment
There was a problem hiding this comment.
Follow-up: gate events and logs behind condition transitions to match the existing codebase pattern and avoid noise on repeated reconciles.
| "No image URL available for caching (machineOSURL and blueFieldOCPLayerImage are both empty)") | ||
| if err := ic.Client.Status().Update(ctx, cr); err != nil { | ||
| log.Error(err, "Failed to persist ImageCached skip condition") | ||
| return ctrl.Result{}, err |
There was a problem hiding this comment.
Conditions + events + logs are all firing unconditionally here (and in the other skip/error paths below). The rest of the codebase gates events and logs behind condition transitions:
if changed := meta.SetStatusCondition(&cr.Status.Conditions, condition); changed {
ic.Recorder.Event(cr, ...)
log.Info(...)
}(See dpucluster/validator.go, secrets/validator.go, bfocplookup/image_lookup.go for examples.)
Without the changed guard, every reconcile emits a duplicate event and log even when nothing changed. Please adopt the same pattern across all the skip/error/success paths in this file.
There was a problem hiding this comment.
Good catch — you're right that this was inconsistent with the rest of the codebase. Refactored all condition/event/log paths in image_cache.go to use the changed return value from meta.SetStatusCondition as a gate, matching the pattern in dpucluster/validator.go, secrets/validator.go, and bfocplookup/image_lookup.go.
Specifically:
setConditionFalse()now returnsbool(thechangedvalue frommeta.SetStatusCondition)- All
Recorder.Event()and informationallog.Info()calls are gated behindif changed - Direct
meta.SetStatusConditioncalls (CachingInProgress, success) also use thechangedguard - Error-level logs (
log.Error) are intentionally left ungated — those should always emit since they indicate an actual failure, not a repeated condition
Fixed in e67265f.
— Lyra
|
@eelgaev: This pull request references NVIDIA-611 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test ? |
|
/test 4.22-dpf-hcp-provisioner-operator-e2e-aws-single-replica |
Explains how to enable the OpenShift internal registry for opportunistic image caching, RBAC requirements, status observation, and failure handling. Co-Authored-By: Lyra
|
Added a user-facing doc at @linoyaslan please take a look when you get a chance. — Lyra |
Adopt the codebase pattern of using the 'changed' return value from meta.SetStatusCondition to gate Recorder.Event and log calls. Without this guard, every reconcile emits duplicate events and log lines even when the condition hasn't changed. This aligns image_cache.go with the existing pattern used in dpucluster/validator.go, secrets/validator.go, and bfocplookup/image_lookup.go.
| PhaseProvisioning DPFHCPProvisionerPhase = "Provisioning" | ||
|
|
||
| // PhaseImageCaching indicates the machine OS image is being cached to the internal registry | ||
| PhaseImageCaching DPFHCPProvisionerPhase = "ImageCaching" |
There was a problem hiding this comment.
do we really need it as a new phase? isn't status condition sufficient?
There was a problem hiding this comment.
You're right — the ImageCached status condition already captures the full lifecycle (in-progress, skipped, failed, succeeded). The phase is redundant here since image caching is opportunistic and doesn't block the CR from progressing the way Provisioning or IgnitionGenerating do.
I'll remove PhaseImageCaching and keep just the condition. The controller will stay in PhaseProvisioning during caching and transition to PhaseIgnitionGenerating directly.
— Lyra
| @@ -0,0 +1,168 @@ | |||
| # Image Caching | |||
There was a problem hiding this comment.
need to add it also in README.md as another feature
There was a problem hiding this comment.
Good point — I'll add a section in the README listing image caching as a feature with a link to docs/image-caching.md for details.
— Lyra
| if machineOSURL == "" { | ||
| machineOSURL = cr.Spec.MachineOSURL | ||
| } | ||
| if machineOSURL == "" { |
There was a problem hiding this comment.
why we have same condition twice?
There was a problem hiding this comment.
now we have 3 options:
CachedMachineOSURL, MachineOSURL and blueFieldOCPLayerImage need to do smart fallback according to priority
There was a problem hiding this comment.
It's not the same condition twice — it's a priority-based fallback chain with 3 different sources:
status.cachedMachineOSURL— the image cached in the internal registry (fastest for DPU provisioning)spec.machineOSURL— an explicit user overridestatus.blueFieldOCPLayerImage— auto-discovered from the OCP layer lookup
The two if machineOSURL == "" checks look similar but each one moves to the next fallback. If none are set, we error out. I can refactor this into a helper like resolveImageURL(cr) to make the intent clearer — would that help?
— Lyra
There was a problem hiding this comment.
That's exactly what the code does — it falls through in priority order: CachedMachineOSURL → MachineOSURL → BlueFieldOCPLayerImage. The cached URL wins because it's in the internal registry (faster pull from the DPU, no external network dependency). The user override (MachineOSURL) is next since it's an explicit choice. Auto-discovered is the last resort.
The error message at the bottom also lists all three so the admin knows exactly which sources were checked. Happy to extract this into a dedicated resolveImageURL() function if you think the inline chain is hard to read.
— Lyra
| // Reconcile performs the image caching workflow. | ||
| // It checks if the internal registry is available and mirrors the image if needed. | ||
| // This is an opportunistic feature - if the registry is not available, it skips gracefully. | ||
| func (ic *ImageCache) Reconcile(ctx context.Context, cr *provisioningv1alpha1.DPFHCPProvisioner) (ctrl.Result, error) { |
There was a problem hiding this comment.
Reconcile isn’t a great function name here it's usually reserved for the controller reconcile loop.
Let's rename it to something more specific to the image cache context
There was a problem hiding this comment.
Fair point — Reconcile is overloaded in the controller-runtime world and seeing it on a non-controller type is confusing. I'll rename it to something like CacheImage() or EnsureCached() to make it clear this is a feature manager, not a controller.
— Lyra
|
what about adding it to our e2e-test? |
| COPY --from=builder /etc/pki/tls/certs/ca-bundle.crt /etc/pki/tls/certs/ca-bundle.crt | ||
| COPY --from=builder /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem |
There was a problem hiding this comment.
can you explain this change please?
There was a problem hiding this comment.
The ubi10-micro base image is stripped down and doesn't include CA certificates. Without these two lines, any HTTPS call the operator makes from inside the container fails with certificate verification errors — specifically the calls to the OpenShift internal registry route (for image caching) and to external registries when pulling images.
We copy the CA bundle from the builder stage (which has the full ubi10 with certs) into the final micro image. Two paths because Go's crypto/tls and net/http check both locations depending on the distro.
— Lyra
|
Good call — the e2e suite currently checks all other conditions but completely skips
I'll add items 1 and 2 in a follow-up commit on this PR — they're straightforward additions to the existing test cases. Item 3 might be better suited as a separate test case or even a unit test with a fake client, since it's testing the degradation path. — Lyra |
Add ImageCached condition check to the existing status conditions test: - Verify the ImageCached condition is always present - When True, verify cachedMachineOSURL is populated and points to the internal registry (image-registry.openshift-image-registry.svc) Add cached URL verification to the ignition content test: - When image caching succeeded, verify the ignition JSON references the cached internal registry URL instead of the external one
|
@eelgaev: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Summary
Add opportunistic image caching feature that mirrors the
machineOSURLcontainer image from an external registry to the OpenShift internal image registry. This improves reliability and performance by allowing multiple DPUs to pull from the local cluster network instead of hitting an external registry for each DPU.Changes
New:
internal/controller/imagecache/packageimage_cache.go—ImageCachemanager withReconcile()method following the feature manager patternregistry.go— Registry availability checking (unstructured client for imageregistry/route APIs), digest comparison, and image mirroring usinggo-containerregistryerrors.go— Custom error types for skip/fail scenariosModified: API Types (
api/v1alpha1/dpfhcpprovisioner_types.go)PhaseImageCachingphase (between Provisioning and IgnitionGenerating)ImageCachedcondition type with reason constants:ImageCached,RegistryNotAvailable,RegistryConfigNotManaged,RegistryRouteNotExposedUpstreamImageChanged,ImagePullFailed,ImagePushFailed,RegistryAuthFailedCachingInProgress,CacheFailedCachedMachineOSURLstatus fieldModified: Controller (
internal/controller/dpfhcpprovisioner_controller.go)cacheImage()method (runs before ignition generation)updatePhaseFromConditions()to include ImageCaching phase logicimageregistry.operator.openshift.ioandroute.openshift.ioModified: Ignition Generator
machineOSURLresolution now prefersstatus.cachedMachineOSURL(cached) →spec.machineOSURL(user-specified) →status.blueFieldOCPLayerImage(auto-discovered)Modified: RBAC
Design Decisions
managementState != Managed, nodefault-route), the feature skips gracefully and uses the external URL directlyTesting
make generate && make manifests && make build— ✅ cleango vet ./internal/controller/imagecache/— ✅ clean— Lyra
Summary by CodeRabbit
New Features
Improvements
Chores
Tests