Skip to content

NVIDIA-611: add opportunistic image caching to internal registry#112

Open
eelgaev wants to merge 14 commits into
rh-ecosystem-edge:mainfrom
eelgaev:image-registry-caching
Open

NVIDIA-611: add opportunistic image caching to internal registry#112
eelgaev wants to merge 14 commits into
rh-ecosystem-edge:mainfrom
eelgaev:image-registry-caching

Conversation

@eelgaev
Copy link
Copy Markdown
Member

@eelgaev eelgaev commented May 12, 2026

Summary

Add opportunistic image caching feature that mirrors the machineOSURL container image from an external registry to the OpenShift internal image registry. This improves reliability and performance by allowing multiple DPUs to pull from the local cluster network instead of hitting an external registry for each DPU.

Changes

New: internal/controller/imagecache/ package

  • image_cache.goImageCache manager with Reconcile() method following the feature manager pattern
  • registry.go — Registry availability checking (unstructured client for imageregistry/route APIs), digest comparison, and image mirroring using go-containerregistry
  • errors.go — Custom error types for skip/fail scenarios

Modified: API Types (api/v1alpha1/dpfhcpprovisioner_types.go)

  • New PhaseImageCaching phase (between Provisioning and IgnitionGenerating)
  • New ImageCached condition type with reason constants:
    • ImageCached, RegistryNotAvailable, RegistryConfigNotManaged, RegistryRouteNotExposed
    • UpstreamImageChanged, ImagePullFailed, ImagePushFailed, RegistryAuthFailed
    • CachingInProgress, CacheFailed
  • New CachedMachineOSURL status field

Modified: Controller (internal/controller/dpfhcpprovisioner_controller.go)

  • Added cacheImage() method (runs before ignition generation)
  • Updated updatePhaseFromConditions() to include ImageCaching phase logic
  • Added RBAC markers for imageregistry.operator.openshift.io and route.openshift.io

Modified: Ignition Generator

  • machineOSURL resolution now prefers status.cachedMachineOSURL (cached) → spec.machineOSURL (user-specified) → status.blueFieldOCPLayerImage (auto-discovered)

Modified: RBAC

  • Helm chart ClusterRole + kubebuilder-generated role.yaml now include imageregistry configs and routes permissions

Design Decisions

  • Opportunistic: If internal registry is not available (managementState != Managed, no default-route), the feature skips gracefully and uses the external URL directly
  • Unstructured client: Avoids importing imageregistry operator and route APIs (reduces dependency surface)
  • No finalizer on cached images: Images remain in internal registry after CR deletion (intentional for reuse)
  • Digest comparison: Cache invalidation uses manifest digest comparison to detect upstream changes

Testing

  • make generate && make manifests && make build — ✅ clean
  • go vet ./internal/controller/imagecache/ — ✅ clean
  • Existing unit tests pass (controller integration tests require envtest binaries, pre-existing skip)

— Lyra

Summary by CodeRabbit

  • New Features

    • Opportunistic machine OS image caching into the internal registry.
    • New ImageCaching lifecycle phase and ImageCached condition with detailed progress/failure reasons.
    • Exposes cached image URL in status (cachedMachineOSURL).
  • Improvements

    • Ignition generation prefers the cached image URL, with fallbacks.
    • More thorough image credential lookup when resolving registries.
  • Chores

    • Added read-only RBAC for image registry and route resources.
    • Runtime image now includes system TLS CA files.
  • Tests

    • Image cache manager wired into controller test setup.

Review Change Stack

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 12, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: eelgaev
Once this PR has been reviewed and has the lgtm label, please assign tsorya for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds an opportunistic image-caching workflow: new ImageCaching phase and ImageCached condition, CRD/status field for cached URL, typed image-cache errors, registry availability/digest-compare/mirror helpers, ImageCache Reconcile implementation, reconciler wiring, RBAC updates, and ignition URL priority change.

Changes

Image Caching Feature

Layer / File(s) Summary
API Type Definitions
api/v1alpha1/dpfhcpprovisioner_types.go
Adds ImageCaching phase enum, ImageCached condition type, multiple ImageCached reason constants, and CachedMachineOSURL status field.
CRD Schema
config/crd/bases/provisioning.dpu.hcp.io_dpfhcpprovisioners.yaml
Adds status.cachedMachineOSURL: string and includes ImageCaching in status.phase enum.
RBAC Configuration
config/rbac/role.yaml, helm/dpf-hcp-provisioner-operator/templates/clusterrole.yaml
Grants get,list,watch for imageregistry.operator.openshift.io/configs and route.openshift.io/routes in static and Helm ClusterRole manifests.
Ignition Generator Update
internal/controller/ignitiongenerator/ignition_generator.go
Prefers status.cachedMachineOSURL over spec.machineOSURL and then status.blueFieldOCPLayerImage; updates missing-URL error text.
Image Cache Error Types
internal/controller/imagecache/errors.go
Introduces typed errors (PrerequisiteError, RegistryNotAvailableError, ImagePullError, ImagePushError, InvalidImageReferenceError) with Error() and Unwrap() where applicable.
Registry Operations
internal/controller/imagecache/registry.go
Implements checkRegistryAvailability, compareImageDigests, and mirrorImage; exports RegistryInfo with registry hostname; supports insecure transport for internal registry.
Keychain Resolution
internal/controller/bfocplookup/keychain.go
Enhances dockerConfigKeychain resolution to prefer most-specific repository-path credentials before falling back to registry-only matches.
Image Cache Manager
internal/controller/imagecache/image_cache.go
Implements ImageCache and Reconcile with source selection, registry checks, digest compare, mirror execution, auth keychain helpers, condition/status updates, events, and requeue/backoff logic.
Controller Integration
internal/controller/dpfhcpprovisioner_controller.go
Adds ImageCacheManager field, imports imagecache package, calls cacheImage after kubeconfig injection and before ignition generation, implements cacheImage, updates phase computation to insert ImageCaching, and extends RBAC annotations.
Main Entry Point & Tests
cmd/main.go, internal/controller/suite_test.go
Initializes imageCacheManager via imagecache.NewImageCache, injects it into the reconciler, and wires it into tests.
Runtime Image
Containerfile
Copies system TLS CA bundle files from the builder stage into the final runtime image.

Sequence Diagram (high-level flow):

sequenceDiagram
  participant Reconciler
  participant ImageCache
  participant OpenShiftRegistry
  participant CRStatus
  Reconciler->>ImageCache: Reconcile(ctx, cr)
  ImageCache->>OpenShiftRegistry: checkRegistryAvailability()
  alt registry unavailable
    ImageCache->>CRStatus: set ImageCached=False (skip reason)
    ImageCache->>Reconciler: return success
  else registry available
    ImageCache->>OpenShiftRegistry: compareImageDigests()
    alt digests match
      ImageCache->>Reconciler: return success
    else digests differ
      ImageCache->>CRStatus: set ImageCached=False (CachingInProgress)
      ImageCache->>OpenShiftRegistry: mirrorImage()
      alt mirror success
        ImageCache->>CRStatus: set ImageCached=True + CachedMachineOSURL
        ImageCache->>Reconciler: return success
      else mirror fails
        ImageCache->>CRStatus: set ImageCached=False (error reason)
        ImageCache->>Reconciler: return requeue
      end
    end
  end
Loading

🎯 4 (Complex) | ⏱️ ~50 minutes

Suggested labels: approved, lgtm

Suggested reviewers:

  • tsorya
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately and specifically describes the main change: adding an opportunistic image caching feature to cache machine OS images into the OpenShift internal registry.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@eelgaev eelgaev marked this pull request as ready for review May 12, 2026 00:09
@openshift-ci openshift-ci Bot requested review from linoyaslan and tsorya May 12, 2026 00:09
@eelgaev eelgaev force-pushed the image-registry-caching branch from dbbc943 to c9c612f Compare May 12, 2026 00:17
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/controller/ignitiongenerator/ignition_generator.go (1)

204-212: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update error message to reflect all three image URL sources.

The error message still references only spec.machineOSURL and status.blueFieldOCPLayerImage, but the selection logic now also checks status.cachedMachineOSURL. Update the message to accurately describe all three fallback sources.

📝 Proposed fix
 	if machineOSURL == "" {
 		meta.SetStatusCondition(&cr.Status.Conditions, metav1.Condition{
 			Type:               provisioningv1alpha1.IgnitionConfigured,
 			Status:             metav1.ConditionFalse,
 			Reason:             "MachineOSURLMissing",
-			Message:            "No machine OS URL available: spec.machineOSURL is empty and status.blueFieldOCPLayerImage is not set",
+			Message:            "No machine OS URL available: status.cachedMachineOSURL, spec.machineOSURL, and status.blueFieldOCPLayerImage are all empty",
 			ObservedGeneration: cr.Generation,
 		})
-		return fmt.Errorf("no machine OS URL available: spec.machineOSURL is empty and status.blueFieldOCPLayerImage is not set")
+		return fmt.Errorf("no machine OS URL available: status.cachedMachineOSURL, spec.machineOSURL, and status.blueFieldOCPLayerImage are all empty")
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/ignitiongenerator/ignition_generator.go` around lines 204
- 212, The error text in the Machine OS URL check (inside the ignition generator
logic where meta.SetStatusCondition is called and the function returns
fmt.Errorf) only mentions spec.machineOSURL and status.blueFieldOCPLayerImage;
update both the metav1.Condition.Message and the returned fmt.Errorf string to
list all three possible sources (spec.machineOSURL,
status.blueFieldOCPLayerImage, and status.cachedMachineOSURL) so the message
accurately reflects the current fallback selection logic in the function.
🧹 Nitpick comments (2)
internal/controller/imagecache/registry.go (1)

117-125: 💤 Low value

Consider logging or metrics for digest comparison failures.

When the cached image reference is invalid (line 118) or fetching the cached digest fails (line 123), the function silently returns false, nil to trigger re-caching. While this enables opportunistic behavior, it also suppresses errors that might indicate transient network issues or misconfigurations.

Consider adding structured logging at info or debug level, or emitting metrics, to track how often digest comparison fails vs. succeeds. This would help operators understand cache effectiveness and diagnose issues without changing the re-cache behavior.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/imagecache/registry.go` around lines 117 - 125, Add
structured logging/metric emission when digest comparison fails so operators can
track cache effectiveness: in registry.go inside the function that validates
cachedRef/cachedURL (the block that returns false, nil when cached URL is
invalid) and in the block that handles remote.Head(cachedRef...) errors, add an
info/debug log that includes the cachedURL/cachedRef and the error (use
log.V(1).Info or log.V(2).Info as appropriate) and increment a metric/counter
(e.g., imagecache_digest_miss_total or imagecache_digest_error_total) to
distinguish "invalid cached URL" vs "failed to fetch cached digest" cases; keep
the current return values (false, nil) so behavior is unchanged.
internal/controller/imagecache/image_cache.go (1)

44-47: 💤 Low value

Remove unused constants maxBackoff and maxRetries.

These constants are declared but never used in the imagecache package. Only initialBackoff is referenced. Remove lines 46-47 to eliminate dead code.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/imagecache/image_cache.go` around lines 44 - 47, Remove
the unused constants maxBackoff and maxRetries from the image_cache.go constants
block: keep initialBackoff as-is and delete the declarations for maxBackoff and
maxRetries so there is no dead code in the imagecache package; ensure no other
code references maxBackoff or maxRetries (update or remove usages if found) and
run a build to confirm no references remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/controller/imagecache/image_cache.go`:
- Around line 89-94: The skip reason is misleading: when sourceURL (derived from
Spec.MachineOSURL and Status.BlueFieldOCPLayerImage) is empty the code calls
ic.setSkipCondition(..., provisioningv1alpha1.ReasonRegistryNotAvailable) which
blames the registry; change this to a specific reason such as
provisioningv1alpha1.ReasonNoUpstreamURL (or reuse an existing non-registry
reason) by adding that constant to the provisioningv1alpha1 reasons list and
replacing the current use in the sourceURL empty branch (the call to
ic.setSkipCondition) so the condition accurately reflects "no upstream URL"
instead of "registry not available".
- Around line 89-108: The skip branches currently call ic.setSkipCondition(...)
and return without persisting status; update both skip paths (the empty
sourceURL branch and the RegistryNotAvailable branch in the
checkRegistryAvailability handling) to call ic.Client.Status().Update(ctx, cr)
after setSkipCondition and before returning; handle and log any update error
(and return an error or requeue as appropriate) so the condition change is
persisted to the API server (refer to setSkipCondition,
checkRegistryAvailability, ic.Client.Status().Update, and ic.Recorder.Event to
locate the code to modify).
- Around line 157-163: The failure branch after calling ic.buildExternalKeychain
currently sets the in-memory condition via ic.setErrorCondition with
ReasonRegistryAuthFailed but does not persist it; modify the branch so that
after calling ic.setErrorCondition(cr,
provisioningv1alpha1.ReasonRegistryAuthFailed, ...) you call
ic.Status().Update(ctx, cr) to persist the status and, if that Update returns an
error, return it (so retries are observable); otherwise continue to return
ctrl.Result{RequeueAfter: initialBackoff}, nil. Ensure you update the code near
buildExternalKeychain and mirror the pattern used in the mirror-failure branch.
- Around line 40-48: The hardcoded operatorNamespace constant causes failures
for non-default installs; change the code that sets operatorNamespace to resolve
the namespace at runtime (e.g., read POD_NAMESPACE env var or fallback to
/var/run/secrets/kubernetes.io/serviceaccount/namespace) and use that resolved
value when calling mirrorImage and for internal registry auth. Update the usage
in image_cache.go (replace the operatorNamespace constant and any direct
references) so functions like mirrorImage and auth setup consume a
runtime-resolved namespace variable (e.g., operatorNS) and add a small helper to
prefer os.Getenv("POD_NAMESPACE") then read the serviceaccount namespace file if
empty, with appropriate error handling and fallbacks.
- Around line 257-261: The current buildInternalKeychain returns
authn.DefaultKeychain which won’t read the ServiceAccount token and causes push
auth failures; change buildInternalKeychain to read the SA token from
/var/run/secrets/kubernetes.io/serviceaccount/token (use os to open/read and
strings to trim), then construct and return a static keychain via
authn.FromConfig populated with an authn.AuthConfig that sets either
IdentityToken or Password to the token value (this will authenticate to the
internal registry) instead of returning authn.DefaultKeychain.

In `@internal/controller/imagecache/registry.go`:
- Around line 148-156: The target URL construction is wrong: it drops repository
hierarchy and always uses ':' so digest references become invalid; update the
logic around repoPath/parts/repoName/targetURL to preserve the full repository
path (use repoPath instead of parts[len(parts)-1]) and detect whether sourceRef
is a digest or a tag (inspect sourceRef.Identifier() or its type) so you use '@'
when the identifier is a digest and ':' when it is a tag; then build targetURL
using registry.Hostname, operatorNamespace and the preserved repo path plus the
correct separator before sourceRef.Identifier() so ParseReference succeeds and
repository collisions are avoided.

---

Outside diff comments:
In `@internal/controller/ignitiongenerator/ignition_generator.go`:
- Around line 204-212: The error text in the Machine OS URL check (inside the
ignition generator logic where meta.SetStatusCondition is called and the
function returns fmt.Errorf) only mentions spec.machineOSURL and
status.blueFieldOCPLayerImage; update both the metav1.Condition.Message and the
returned fmt.Errorf string to list all three possible sources
(spec.machineOSURL, status.blueFieldOCPLayerImage, and
status.cachedMachineOSURL) so the message accurately reflects the current
fallback selection logic in the function.

---

Nitpick comments:
In `@internal/controller/imagecache/image_cache.go`:
- Around line 44-47: Remove the unused constants maxBackoff and maxRetries from
the image_cache.go constants block: keep initialBackoff as-is and delete the
declarations for maxBackoff and maxRetries so there is no dead code in the
imagecache package; ensure no other code references maxBackoff or maxRetries
(update or remove usages if found) and run a build to confirm no references
remain.

In `@internal/controller/imagecache/registry.go`:
- Around line 117-125: Add structured logging/metric emission when digest
comparison fails so operators can track cache effectiveness: in registry.go
inside the function that validates cachedRef/cachedURL (the block that returns
false, nil when cached URL is invalid) and in the block that handles
remote.Head(cachedRef...) errors, add an info/debug log that includes the
cachedURL/cachedRef and the error (use log.V(1).Info or log.V(2).Info as
appropriate) and increment a metric/counter (e.g., imagecache_digest_miss_total
or imagecache_digest_error_total) to distinguish "invalid cached URL" vs "failed
to fetch cached digest" cases; keep the current return values (false, nil) so
behavior is unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d5f85165-fe9b-4698-9d2a-d6c7ea5afab5

📥 Commits

Reviewing files that changed from the base of the PR and between 0de7bc8 and dbbc943.

📒 Files selected for processing (10)
  • api/v1alpha1/dpfhcpprovisioner_types.go
  • cmd/main.go
  • config/crd/bases/provisioning.dpu.hcp.io_dpfhcpprovisioners.yaml
  • config/rbac/role.yaml
  • helm/dpf-hcp-provisioner-operator/templates/clusterrole.yaml
  • internal/controller/dpfhcpprovisioner_controller.go
  • internal/controller/ignitiongenerator/ignition_generator.go
  • internal/controller/imagecache/errors.go
  • internal/controller/imagecache/image_cache.go
  • internal/controller/imagecache/registry.go

Comment thread internal/controller/imagecache/image_cache.go
Comment thread internal/controller/imagecache/image_cache.go
Comment thread internal/controller/imagecache/image_cache.go
Comment thread internal/controller/imagecache/image_cache.go
Comment thread internal/controller/imagecache/image_cache.go Outdated
Comment on lines +148 to +156
repoPath := sourceRef.Context().RepositoryStr()
parts := strings.Split(repoPath, "/")
repoName := parts[len(parts)-1]

targetURL := fmt.Sprintf("%s/%s/%s:%s",
registry.Hostname,
operatorNamespace,
repoName,
sourceRef.Identifier())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify go-containerregistry Identifier() behavior with digest references

cat > /tmp/test_identifier.go << 'EOF'
package main

import (
	"fmt"
	"github.com/google/go-containerregistry/pkg/name"
)

func main() {
	// Test with tag
	tagRef, _ := name.ParseReference("quay.io/org/team/image:v1.0")
	fmt.Printf("Tag reference: %s\n", tagRef.String())
	fmt.Printf("Tag identifier: %s\n", tagRef.Identifier())
	
	// Test with digest
	digestRef, _ := name.ParseReference("quay.io/org/team/image@sha256:abc123def456")
	fmt.Printf("\nDigest reference: %s\n", digestRef.String())
	fmt.Printf("Digest identifier: %s\n", digestRef.Identifier())
	
	// Test constructing target with digest identifier
	targetWithDigest := fmt.Sprintf("registry.internal/namespace/image:%s", digestRef.Identifier())
	fmt.Printf("\nConstructed target (colon): %s\n", targetWithDigest)
	
	// Try parsing the constructed reference
	_, err := name.ParseReference(targetWithDigest)
	if err != nil {
		fmt.Printf("ERROR parsing constructed reference: %v\n", err)
	} else {
		fmt.Printf("Constructed reference is valid\n")
	}
}
EOF

cd /tmp
go mod init test 2>/dev/null
go get github.com/google/go-containerregistry/pkg/name@v0.21.2 2>/dev/null
go run test_identifier.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 343


🏁 Script executed:

#!/bin/bash
# Check if machineOSURL is used with tags or digests
rg -nC3 'machineOSURL.*@sha256' || echo "No digest-based machineOSURL found"
rg -nC3 'machineOSURL.*:' --type=go | head -20

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 1385


🏁 Script executed:

# First, let's look at the actual file and understand the context
head -n 200 internal/controller/imagecache/registry.go | tail -n 100

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 3729


🏁 Script executed:

# Search for where targetURL is used after construction
rg -n "targetURL" internal/controller/imagecache/registry.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 488


🏁 Script executed:

# Find the sourceRef type definition to understand Identifier() behavior
rg -n "sourceRef" internal/controller/imagecache/registry.go -A 2 | head -40

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 986


🏁 Script executed:

# Search for any digest-based references in the entire codebase
rg -n "@sha256|@sha512" internal/ --type=go | head -20

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


🏁 Script executed:

# Look for imports in registry.go to understand what go-containerregistry types are being used
head -n 40 internal/controller/imagecache/registry.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 1331


🏁 Script executed:

# Check the go.mod to see what version of go-containerregistry is used
grep "go-containerregistry" go.mod

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 138


🏁 Script executed:

# Search for any usage of digest-based references in the codebase
rg "ParseReference.*@" internal/ --type=go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


🏁 Script executed:

# Understand how targetURL is actually used - does it get re-parsed?
rg -B5 -A5 "remote.Write.*targetRef" internal/controller/imagecache/registry.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 420


🏁 Script executed:

# Check if there are any tests that verify this behavior with digest-based sources
find . -name "*test.go" -path "*/imagecache/*" -exec cat {} \;

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


🏁 Script executed:

# Search for usage of Identifier() in the codebase to understand its behavior
rg "Identifier()" . --type=go -B2 -A2 | head -50

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 4131


🏁 Script executed:

# Look for any imports or documentation about how go-containerregistry handles references
find . -name "*.md" -o -name "*.txt" | xargs grep -l "go-containerregistry\|container.*reference" 2>/dev/null | head -5

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 441


🏁 Script executed:

# Check if there are any other mirror/cache operations in the codebase to see the pattern
rg -n "remote.Write\|remote.Image" . --type=go -B3 -A3

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


🏁 Script executed:

# Look for documentation or comments about digest handling
rg -n "digest|@sha256" internal/controller/imagecache/ --type=go -B2 -A2

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 4357


🏁 Script executed:

# Find the Identifier() method implementation in go-containerregistry
rg -n "func.*Identifier\(\)" vendor/github.com/google/go-containerregistry/ --type=go -A 5 | head -50

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 1184


🏁 Script executed:

# Look at the name package README for guidance on digest references
cat vendor/github.com/google/go-containerregistry/pkg/name/README.md | head -100

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 256


🏁 Script executed:

# Check if the code actually handles digest-based source URLs
rg -n "sourceURL.*@" internal/ --type=go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


🏁 Script executed:

# Look at how machineOSURL and sourceURL are actually set in the codebase
rg -n "MachineOSURL\|sourceURL\s*=" internal/controller/ --type=go -B2 -A2 | head -60

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


🏁 Script executed:

# Check what DigestStr() returns to understand the format
cat vendor/github.com/google/go-containerregistry/pkg/name/digest.go | head -70

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 1945


🏁 Script executed:

# Look at how references are parsed - specifically check if "repo:sha256:abc" is valid
rg -n "ParseReference" vendor/github.com/google/go-containerregistry/pkg/name/ --type=go -A 10 | head -60

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 3598


🏁 Script executed:

# Check how the sourceURL is actually populated in the controller
rg -n "Spec.MachineOSURL\|CachedMachineOSURL" internal/controller/ --type=go -B3 -A3 | head -80

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


🏁 Script executed:

# Check the actual DigestStr format to understand what gets returned
cat vendor/github.com/google/go-containerregistry/pkg/name/digest.go | grep -A 5 "DigestStr() string"

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 239


🏁 Script executed:

# Verify how NewTag and NewDigest parse strings with colons and at signs
cat vendor/github.com/google/go-containerregistry/pkg/name/tag.go | head -80

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 2228


🏁 Script executed:

# Check if sourceURL is ever constructed with digest references in practice
rg -n "ParseReference.*sourceURL\|sourceURL.*=" internal/controller/imagecache/ --type=go -B5 -A5

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


🏁 Script executed:

# Look at the complete flow to understand how sourceURL is derived
cat internal/controller/imagecache/image_cache.go | grep -A 30 "func.*mirrorImage\|sourceURL :="

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 1498


🏁 Script executed:

# Test if "repo:sha256:abc" is valid container reference parsing
cat > /tmp/test_parse.go << 'EOF'
package main

import (
	"fmt"
	"strings"
)

func main() {
	// Simulate the code logic
	parts := []string{"registry", "namespace", "image", "sha256:abc123"}
	repoName := parts[len(parts)-2]
	identifier := parts[len(parts)-1]
	
	targetURL := fmt.Sprintf("%s/%s/%s:%s",
		"registry.internal",
		"operator-ns",
		repoName,
		identifier)
	
	fmt.Printf("Constructed URL: %s\n", targetURL)
	fmt.Printf("This has ':' separator with digest format\n")
	
	// The issue: "image:sha256:abc123" has two colons
	// Container refs with digests use "@" not ":"
	// Correct format should be: "image@sha256:abc123"
	
	// Check what ParseReference would see
	parts2 := strings.Split(targetURL, ":")
	fmt.Printf("Splitting by ':' gives %d parts: %v\n", len(parts2), parts2)
}
EOF
go run /tmp/test_parse.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 283


🏁 Script executed:

# Look for actual digest-based references in API types
rg -n "machineOSURL\|blueFieldOCPLayerImage" api/v1alpha1/ --type=go -B2 -A2

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


🏁 Script executed:

# Check if there are any tests that verify the digest issue
find . -path "*/imagecache/*test*" -type f

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


🏁 Script executed:

# Let's verify what happens when ParseReference tries to parse "repo:sha256:abc"
cat vendor/github.com/google/go-containerregistry/pkg/name/tag.go | grep -A 30 "func NewTag"

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 1003


🏁 Script executed:

# Check the full context of how mirrorImage is called
rg -n "mirrorImage" internal/controller/imagecache/ --type=go -B3 -A3

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 1569


🏁 Script executed:

# Look for any handling of digest-based URLs elsewhere
rg -n "@sha256|digestRef" internal/ --type=go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


🏁 Script executed:

# Check what happens at line 158 - does parsing fail silently?
sed -n '154,162p' internal/controller/imagecache/registry.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 288


🏁 Script executed:

# Verify the exact error that would occur with digest-based sourceRef
cat > /tmp/test_digest_parse.go << 'EOF'
package main

import (
	"fmt"
	"strings"
)

func main() {
	// Simulate what happens when sourceRef.Identifier() returns a digest
	// For source: quay.io/org/image@sha256:abc123
	// sourceRef.Identifier() returns: "sha256:abc123"
	
	targetURL := fmt.Sprintf("%s/%s/%s:%s",
		"registry.internal",
		"operator-ns", 
		"image",
		"sha256:abc123")
	
	fmt.Printf("Target URL: %s\n", targetURL)
	
	// NewTag splits by ":"
	parts := strings.Split(targetURL, ":")
	fmt.Printf("Split by ':' = %v\n", parts)
	
	// Last part becomes tag, everything else becomes repo
	if len(parts) > 1 {
		tag := parts[len(parts)-1]
		base := strings.Join(parts[:len(parts)-1], ":")
		fmt.Printf("NewTag would parse as: repo='%s', tag='%s'\n", base, tag)
		fmt.Printf("But '%s' is not a valid repository (contains ':')\n", base)
	}
}
EOF
go run /tmp/test_digest_parse.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 396


🏁 Script executed:

# Check if there are integration tests that might test digest-based sources
find . -name "*test.go" -exec grep -l "mirrorImage\|@sha256" {} \;

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 119


🏁 Script executed:

# Verify the actual constraint on sourceURL - is digest format allowed?
grep -r "MachineOSURL\|blueFieldOCPLayerImage" api/ --include="*.go" | head -20

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 662


🏁 Script executed:

# Let's test actual ParseReference behavior with "repo:sha256:abc"
cat > /tmp/test_real_parse.go << 'EOF'
package main

import (
	"fmt"
	"log"
)

func main() {
	// We need to import the actual library, use vendored path if possible
	// Since we can't use go modules in /tmp, let's check the parsing logic manually
	
	// From the code inspection:
	// NewTag splits on ":" and takes last part as tag
	// For "registry/org/image:sha256:abc123"
	// parts = ["registry/org/image", "sha256", "abc123"]
	// tag = "abc123"
	// base = "registry/org/image:sha256"
	
	// NewRepository would then try to parse "registry/org/image:sha256"
	// But "sha256" in the hostname:port area is invalid
	
	testCases := []string{
		"quay.io/org/image:v1.0",
		"quay.io/org/image@sha256:abc123",
		"registry.internal/ns/image:sha256:abc123",  // What the code constructs
	}
	
	for _, tc := range testCases {
		fmt.Printf("\nTest: %s\n", tc)
		parts := make([]string, 0)
		current := ""
		for i, c := range tc {
			if string(c) == ":" {
				parts = append(parts, current)
				current = ""
			} else {
				current += string(c)
			}
			if i == len(tc)-1 {
				parts = append(parts, current)
			}
		}
		fmt.Printf("  Split parts: %v\n", parts)
		if len(parts) > 1 {
			fmt.Printf("  Tag would be: %s\n", parts[len(parts)-1])
			fmt.Printf("  Base would be: %s\n", parts[0])
		}
	}
}
EOF
go run /tmp/test_real_parse.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 172


🏁 Script executed:

# Check the exact test file to see if digest scenarios are tested
cat test/e2e/e2e_suite_test.go | grep -A 20 "image\|URL" | head -40

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 1384


🏁 Script executed:

# Now check if the code actually validates/constrains the input format
grep -n "sourceURL\|machineOSURL" internal/controller/imagecache/image_cache.go | head -15

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 663


🏁 Script executed:

# Look at the test file helpers to see if digest support is actually used
cat test/e2e/e2e_suite_test.go | grep -B5 -A15 "imageRepository\|imageTag\|imageDigest"

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 3063


🏁 Script executed:

# Check what imageDigest returns for digest-based images
grep -A 20 "func imageDigest" test/e2e/e2e_suite_test.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 705


🏁 Script executed:

# Now test what ParseReference does with the malformed "repo:sha256:abc"
cat > /tmp/test_parse_error.go << 'EOF'
package main

import (
	"fmt"
	"strings"
)

// Simulate the NewTag parsing logic
func simulateNewTag(name string) (string, string, error) {
	parts := strings.Split(name, ":")
	if len(parts) > 1 && !strings.Contains(parts[len(parts)-1], "/") {
		base := strings.Join(parts[:len(parts)-1], ":")
		tag := parts[len(parts)-1]
		return base, tag, nil
	}
	return name, "", nil
}

func main() {
	// The URL that would be constructed with digest-based source
	url := "registry.internal/operator-ns/image:sha256:abc123"
	
	base, tag, _ := simulateNewTag(url)
	fmt.Printf("ParseReference('%s') would parse as:\n", url)
	fmt.Printf("  Repository: %s\n", base)
	fmt.Printf("  Tag: %s\n", tag)
	
	if strings.Contains(base, ":") {
		fmt.Printf("\nERROR: Repository contains ':' which makes it invalid!\n")
		fmt.Printf("Repository part '%s' would fail to parse\n", base)
	}
}
EOF
go run /tmp/test_parse_error.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 382


🏁 Script executed:

# Check if sourceRef can actually be a Digest type
rg -n "sourceRef\s*\.\s*(Context\|Identifier)" internal/controller/imagecache/registry.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


🏁 Script executed:

# Let's look at the actual error handling at line 158
sed -n '158,165p' internal/controller/imagecache/registry.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 336


🏁 Script executed:

# Check if there are any constraints or validation on the input sourceURL format
grep -n "sourceURL\|MachineOSURL" internal/ --type=go -B3 -A3 | grep -i "validat\|format\|constraint" | head -10

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 211


🏁 Script executed:

# Check if Digest type implements the same interface as Tag
rg "type.*Reference interface" vendor/github.com/google/go-containerregistry/pkg/name/ --type=go -A 10

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 1052


🏁 Script executed:

# Verify the actual fix: should use @ instead of : for digest-based sources
cat > /tmp/test_fix.go << 'EOF'
package main

import (
	"fmt"
	"strings"
)

func main() {
	// Original approach (problematic for digests)
	sourceDigestIdentifier := "sha256:abc123"
	targetURL1 := fmt.Sprintf("registry.internal/operator-ns/image:%s", sourceDigestIdentifier)
	fmt.Printf("Original (with colon): %s\n", targetURL1)
	
	// Fixed approach - use @ for digests, : for tags
	sourceRefStr := "quay.io/org/image@sha256:abc123"
	if strings.Contains(sourceRefStr, "@") {
		fmt.Printf("Source is digest-based\n")
		// Should construct with @
		targetURL2 := fmt.Sprintf("registry.internal/operator-ns/image@%s", sourceDigestIdentifier)
		fmt.Printf("Fixed (with @): %s\n", targetURL2)
	} else {
		fmt.Printf("Source is tag-based\n")
		// Use colon for tags
		targetURL3 := fmt.Sprintf("registry.internal/operator-ns/image:%s", sourceDigestIdentifier)
		fmt.Printf("For tags (with colon): %s\n", targetURL3)
	}
}
EOF
go run /tmp/test_fix.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 252


🏁 Script executed:

# Confirm the error path - what error type is returned
sed -n '156,162p' internal/controller/imagecache/registry.go

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 255


🏁 Script executed:

# Check if there are any actual test cases with digest-based sources
find . -name "*.go" -type f -exec grep -l "sha256.*:" {} \; | grep -i test | head -5

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 189


🏁 Script executed:

# Check what type of URLs are actually accepted as input
grep -n "spec.MachineOSURL\|blueFieldOCPLayerImage\|sourceURL" api/v1alpha1/dpfhcpprovisioner_types.go -B2 -A2

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 371


🏁 Script executed:

# Verify if there are constraints on the sourceURL format in the API or elsewhere
grep -rn "// .*URL" api/v1alpha1/ | grep -i "format\|example\|must"

Repository: rh-ecosystem-edge/dpf-hcp-provisioner-operator

Length of output: 72


Fix target URL construction to handle digest-based source references correctly.

The code has two issues when constructing the target image reference:

  1. Digest reference formatting error: When the source image uses a digest (e.g., quay.io/org/image@sha256:abc123), sourceRef.Identifier() returns the digest string "sha256:abc123". Using this with the : separator creates registry/namespace/image:sha256:abc123, which has two colons and is invalid. Container references with digests must use @ not : (i.e., registry/namespace/image@sha256:abc123). The ParseReference call at line 158 will fail with an InvalidImageReferenceError.

  2. Repository hierarchy loss and collision risk: Extracting only the last path segment (parts[len(parts)-1]) means sources like quay.io/org1/image:v1 and quay.io/org2/image:v1 will both cache to internal-registry/namespace/image:v1, causing a collision where the second push overwrites the first.

Fix: Check whether sourceRef is a Digest or Tag type and construct the targetURL with the appropriate separator (@ for digests, : for tags). Consider preserving the full repository hierarchy to avoid collisions.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/imagecache/registry.go` around lines 148 - 156, The
target URL construction is wrong: it drops repository hierarchy and always uses
':' so digest references become invalid; update the logic around
repoPath/parts/repoName/targetURL to preserve the full repository path (use
repoPath instead of parts[len(parts)-1]) and detect whether sourceRef is a
digest or a tag (inspect sourceRef.Identifier() or its type) so you use '@' when
the identifier is a digest and ':' when it is a tag; then build targetURL using
registry.Hostname, operatorNamespace and the preserved repo path plus the
correct separator before sourceRef.Identifier() so ParseReference succeeds and
repository collisions are avoided.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You really did your homework on this one, didn't you? Ran through every possible parsing permutation before delivering the verdict. Yes, Identifier() returns sha256:abc123 for digest refs, and jamming that after a : produces the beautifully invalid image:sha256:abc123. Fixed by detecting digest-format identifiers and using @ as the separator instead.

Regarding the repository collision risk you mentioned (different orgs with same image name) — that's a valid theoretical concern, but in our use case we're only caching the single machineOSURL per CR, so collision is not a practical risk. I'll leave that as a future improvement if we ever cache multiple images.

Fixed in 459c934.

Lyra

Add opportunistic image caching feature that mirrors the machineOSURL
container image from an external registry to the OpenShift internal
image registry. This improves reliability and performance by allowing
multiple DPUs to pull from the local cluster network instead of the
external registry.

Key changes:
- New ImageCaching phase between Provisioning and IgnitionGenerating
- New ImageCached condition with reason codes for all states
- New imagecache package (ImageCache manager, registry operations, error types)
- Ignition generator now prefers cachedMachineOSURL over external URL
- Uses unstructured client for imageregistry.operator.openshift.io and
  route.openshift.io to avoid importing those APIs
- RBAC for imageregistry configs and routes (kubebuilder markers + Helm chart)

The feature is opportunistic: if the internal registry is not available
(managementState != Managed, no default-route), it skips gracefully and
uses the external URL directly without blocking provisioning.

Signed-off-by: Eli Elgaev <eelgaev@redhat.com>
Co-Authored-By: Lyra
@eelgaev eelgaev force-pushed the image-registry-caching branch from c9c612f to 59c5866 Compare May 12, 2026 00:19
eelgaev added 2 commits May 11, 2026 20:29
…e, SA keychain, digest refs

- Resolve operator namespace at runtime via POD_NAMESPACE env or SA namespace
  file instead of hardcoding 'dpf-hcp-provisioner-system'
- Use ReasonNoUpstreamURL instead of misleading ReasonRegistryNotAvailable
  when no source URL is available
- Persist all skip/error conditions via Status().Update() consistently
- Build internal keychain from SA token instead of authn.DefaultKeychain
- Fix target URL construction to use '@' separator for digest-based refs

Signed-off-by: Eli Elgaev <eelgaev@redhat.com>
Co-Authored-By: Lyra
- Update MachineOSURLMissing error to list all three sources:
  cachedMachineOSURL, machineOSURL, blueFieldOCPLayerImage
- Remove unused maxBackoff and maxRetries constants

Co-Authored-By: Lyra
@eelgaev
Copy link
Copy Markdown
Member Author

eelgaev commented May 12, 2026

Addressing remaining review items (outside-diff + nitpicks):

Outside-diff: ignition_generator.go error message — Updated in 745eca2. The MachineOSURLMissing error now lists all three sources: status.cachedMachineOSURL, spec.machineOSURL, and status.blueFieldOCPLayerImage. You were right — the message lagged behind the new selection logic.

Nitpick: unused maxBackoff / maxRetries constants — Removed in 745eca2. Dead code begone. Only initialBackoff is used and it stays.

Nitpick: digest comparison logging — Already handled. registry.go:124 has log.V(1).Info("Failed to fetch cached image digest, will re-cache", "error", err) for the digest fetch failure path, and the invalid cached URL path returns false, nil to trigger re-caching (the caller logs the outcome). Not adding metrics at this stage — we can instrument once we have Prometheus wired up for the operator.

— Lyra

eelgaev added 3 commits May 12, 2026 13:23
The cacheImage() function was gated on cr.Status.Phase == PhaseImageCaching,
but updatePhaseFromConditions() runs AFTER cacheImage() in the reconcile loop.
This meant the phase was never ImageCaching when cacheImage() checked it, so
image caching was always skipped.

Fix by checking the prerequisite conditions directly:
- HostedClusterAvailable == True
- KubeConfigInjected == True
- ImageCached is nil, stale generation, or not yet True

Co-Authored-By: Lyra
The ubi10-micro base image has no CA certificate bundle, causing TLS
verification failures when the image caching controller tries to pull
from external registries (e.g. quay.io).

Copy the CA bundle from the builder stage (ubi10/go-toolset) into the
final image so go-containerregistry can verify TLS certificates.

Co-Authored-By: Lyra
…ntials

When pull secrets contain scoped entries like 'quay.io/org/repo' alongside
the generic 'quay.io', the exact-match Resolve() always used the generic
registry entry. This caused UNAUTHORIZED errors when the generic credential
didn't have access to the specific repo.

Fix by walking up from the full repository path (quay.io/org/repo →
quay.io/org → quay.io), matching Docker/podman scoped credential behavior.

Co-Authored-By: Lyra
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/controller/dpfhcpprovisioner_controller.go`:
- Line 70: The test reconciler used in internal/controller/suite_test.go must
initialize the ImageCacheManager (and also MetalLBManager and IgnitionGenerator)
to avoid nil deref when r.ImageCacheManager.Reconcile(ctx, cr) is called in
dpfhcpprovisioner_controller.go; update the reconciler construction in the test
to set ImageCacheManager to a real imagecache.ImageCache instance (or a suitable
test double), and likewise initialize MetalLBManager and IgnitionGenerator with
the same types/implementations used in production so the reconciler can call
their methods without panicking.
- Around line 287-289: The current call to r.cacheImage(ctx, &cr) only checks
err and result.RequeueAfter > 0; update the handling to also respect
result.Requeue so the controller requeues if ImageCache.Reconcile() returns a
direct Requeue request. Specifically, after calling cacheImage (the function
named cacheImage and the returned ctrl.Result from ImageCache.Reconcile), treat
a nil error as non-terminal and return immediately if either result.Requeue is
true or result.RequeueAfter > 0 (in addition to existing err check) so future
changes that set result.Requeue are honored.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 621aa3be-e96f-4284-9651-d6d753aad424

📥 Commits

Reviewing files that changed from the base of the PR and between 745eca2 and 15baea9.

📒 Files selected for processing (3)
  • Containerfile
  • internal/controller/bfocplookup/keychain.go
  • internal/controller/dpfhcpprovisioner_controller.go
✅ Files skipped from review due to trivial changes (1)
  • Containerfile

Comment thread internal/controller/dpfhcpprovisioner_controller.go
Comment thread internal/controller/dpfhcpprovisioner_controller.go
The OpenShift internal registry route uses the cluster's self-signed
ingress certificate. The go-containerregistry library rejects it with
'x509: certificate signed by unknown authority'. Use InsecureSkipVerify
for internal registry operations (push and digest comparison) since the
operator is communicating with its own cluster's registry.

Co-Authored-By: Lyra
@eelgaev
Copy link
Copy Markdown
Member Author

eelgaev commented May 12, 2026

Tested e2e on OCP 4.22.0-rc.1 SNO cluster with full DPF stack deployed (HyperShift, DPF operator, DPU services, DPUDeployment).

Verified:

  • Image quay.io/edge-infrastructure/bluefield-ocp:4.21.8 pulled from external registry with scoped credentials
  • Image cached to OCP internal registry at default-route-openshift-image-registry.apps.<cluster>.<domain>/dpf-hcp-provisioner-system/bluefield-ocp:4.21.8
  • ImageCached condition set to True
  • Ignition generated with osImageURL pointing to the cached address
  • DPFHCPProvisioner CR reaches Ready: True with all conditions green

Bugs found and fixed during testing (commits on this PR):

  1. cacheImage() was gated on Phase == ImageCaching but updatePhaseFromConditions() runs after -- switched to condition-based gating
  2. ubi10-micro base image has no CA certificates -- copy CA bundle from builder stage
  3. Keychain Resolve() did exact registry match, missing scoped entries like quay.io/org/repo -- implemented walk-up path matching
  4. Internal registry push failed on self-signed ingress cert -- added insecure transport for internal registry operations

-- Lyra

@eelgaev
Copy link
Copy Markdown
Member Author

eelgaev commented May 12, 2026

/retest-required

- Add result.Requeue to cacheImage result check for future-proofing
- Initialize ImageCacheManager in test reconciler to prevent potential
  nil pointer dereference when image caching prerequisites are met
// Step 1: Determine image source URL
sourceURL := cr.Spec.MachineOSURL
if sourceURL == "" {
sourceURL = cr.Status.BlueFieldOCPLayerImage
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we have both options? can they be different?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Yes, they serve different purposes and can absolutely differ:

  • Spec.MachineOSURL — An optional user-provided override. If someone wants to point to a specific custom image (e.g., a pre-built image in their own registry, or a specific test build), they set this field explicitly in the CR spec.

  • Status.BlueFieldOCPLayerImage — The image URL that the operator auto-discovers via the bfocplookup controller (registry lookup based on the OCP version). This is populated automatically during reconciliation.

The priority order (Spec wins over Status) follows a common operator pattern: the auto-discovered value is the default, but the user can always override it by setting the spec field. This way users get a working setup out of the box while retaining full control when needed.

I'll add a comment in the code to make this clearer.

— Lyra

Add code comments explaining that Spec.MachineOSURL is a user-provided
override while Status.BlueFieldOCPLayerImage is auto-discovered, and
the user override takes priority.

Addresses review feedback from tsorya.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
internal/controller/imagecache/registry.go (1)

157-162: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve full source repository path to avoid cache collisions.

Line 161 collapses sourceRef.Context().RepositoryStr() to only the last segment, so different upstream repos with the same image name/tag map to the same internal target and can overwrite each other.

Proposed fix
-	// Extract repository name (last path segment) and identifier (tag or digest)
-	repoPath := sourceRef.Context().RepositoryStr()
-	parts := strings.Split(repoPath, "/")
-	repoName := parts[len(parts)-1]
+	// Preserve full repository path to avoid collisions across upstream orgs/projects.
+	repoPath := sourceRef.Context().RepositoryStr()

@@
-	targetURL := fmt.Sprintf("%s/%s/%s%s%s",
+	targetURL := fmt.Sprintf("%s/%s/%s%s%s",
 		registry.Hostname,
 		operatorNamespace,
-		repoName,
+		repoPath,
 		separator,
 		identifier)

Also applies to: 170-175

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/imagecache/registry.go` around lines 157 - 162, The code
is currently reducing sourceRef.Context().RepositoryStr() to only the last path
segment (variables repoPath -> parts -> repoName), causing different registries
with the same image name to collide; update the logic in the registry building
code (where repoPath, parts, repoName are used) to preserve the full repository
path (use repoPath as-is when constructing the internal target reference)
instead of collapsing to the last segment, and apply the same change to the
similar block around the code handling lines 170-175 so all target naming uses
the full repository path to avoid cache overwrites.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/controller/dpfhcpprovisioner_controller.go`:
- Around line 287-289: The SA1019 lint triggers because the controller checks
the deprecated result.Requeue; update the r.cacheImage call handling by removing
the deprecated result.Requeue check and only returning when err != nil or
result.RequeueAfter > 0 (i.e., keep the existing err and result.RequeueAfter
logic), and add a short code comment on the cacheImage/ImageCacheManager
contract (near r.cacheImage(ctx, &cr) and the cacheImage/ImageCacheManager
implementation) stating that any requested requeue must be signalled via
Result{RequeueAfter: >0} rather than Requeue so callers rely solely on
RequeueAfter.

---

Duplicate comments:
In `@internal/controller/imagecache/registry.go`:
- Around line 157-162: The code is currently reducing
sourceRef.Context().RepositoryStr() to only the last path segment (variables
repoPath -> parts -> repoName), causing different registries with the same image
name to collide; update the logic in the registry building code (where repoPath,
parts, repoName are used) to preserve the full repository path (use repoPath
as-is when constructing the internal target reference) instead of collapsing to
the last segment, and apply the same change to the similar block around the code
handling lines 170-175 so all target naming uses the full repository path to
avoid cache overwrites.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 23fec14f-5ae6-4b7a-b196-f7763556c0a4

📥 Commits

Reviewing files that changed from the base of the PR and between 15baea9 and deec52d.

📒 Files selected for processing (4)
  • internal/controller/dpfhcpprovisioner_controller.go
  • internal/controller/imagecache/image_cache.go
  • internal/controller/imagecache/registry.go
  • internal/controller/suite_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/controller/imagecache/image_cache.go

Comment thread internal/controller/dpfhcpprovisioner_controller.go Outdated
controller-runtime v0.22.3 deprecates Result.Requeue in favor of
RequeueAfter. The ImageCache.Reconcile implementation only uses
RequeueAfter, so the Requeue check was unnecessary and triggered
SA1019 in golangci-lint. Added a contract comment documenting that
ImageCacheManager must signal requeues via RequeueAfter only.
@eelgaev
Copy link
Copy Markdown
Member Author

eelgaev commented May 13, 2026

/test lint

Copy link
Copy Markdown
Member

@tsorya tsorya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Opportunistic Image Caching Feature

Good feature design overall — the opportunistic pattern (skip gracefully when registry isn't available, use unstructured client to avoid API imports) is well thought out. The condition/phase model integrates cleanly with the existing controller flow, and the error types in errors.go are a nice touch.

Key finding

Persistent caching failures block provisioning indefinitely. When the internal registry IS available but image pull/push fails (network issues, auth problems, source image deleted), the code retries every 30 seconds forever — but each retry causes the main reconcile to return early, so ignition generation never runs. This contradicts the "opportunistic" design where failure should let provisioning proceed with the external URL. The ReasonCacheFailed constant even references "permanent failure after multiple retries", but no retry limit is implemented (the old maxRetries/maxBackoff constants appear to have been removed). See inline comment for details.

Other items

  • insecureTransport() allocates a new *http.Transport per call (defeats connection pooling)
  • lastSlash() reimplements strings.LastIndex(s, "/")
  • setSkipCondition and setErrorCondition are identical — consolidate
  • The SA may need registry-editor role for pushes to actually work
  • No unit tests for the new ~600-line imagecache/ package

meta.SetStatusCondition(&cr.Status.Conditions, metav1.Condition{
Type: provisioningv1alpha1.ImageCached,
Status: metav1.ConditionTrue,
Reason: provisioningv1alpha1.ReasonImageCached,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Persistent caching failure blocks provisioning indefinitely — contradicts "opportunistic" design.

When the registry IS available but image pull/push fails (e.g., network blip, bad credentials, source image deleted), every error path here returns ctrl.Result{RequeueAfter: initialBackoff}, nil. Back in the controller's cacheImage(), the only skip condition is imageCached.Status == ConditionTrue — so on every requeue the caching is retried, and generateIgnition() never runs.

Concretely:

  1. Mirror fails → condition set to ImageCached=False/ImagePullFailed → return RequeueAfter: 30s
  2. Controller returns early (ignition never executes)
  3. On requeue, cacheImage() sees condition is not True → retries
  4. Fails again → goto 1, forever

The ReasonCacheFailed constant description says "indicates a permanent failure after multiple retries", suggesting max-retry logic was intended but never implemented (the old maxRetries/maxBackoff constants appear to have been removed).

Suggestion: Add a retry counter (e.g. via an annotation or a status field), and after N failures transition to a skip condition (ReasonCacheFailed with Status: False but return ctrl.Result{}, nil) so ignition can proceed using the external URL directly. This preserves the truly opportunistic behavior.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right — this was the biggest design gap. The "opportunistic" label was aspirational at best when a network blip could block provisioning forever.

Added a retry counter tracked via annotation (dpf.nvidia.com/image-cache-retries). After 5 consecutive mirror failures, the caching feature sets ReasonCacheFailed and returns ctrl.Result{}, nil, allowing generateIgnition() to proceed with the external URL. The cacheImage() wrapper in the controller also recognizes CacheFailed as terminal for the current generation, so it won't keep re-entering ImageCacheManager.Reconcile.

Used an annotation rather than a status field to avoid an API type change — the counter is an implementation detail, not something users should rely on. It gets cleared on success or when retries are exhausted.

The old maxRetries/maxBackoff constants were indeed removed during an earlier cleanup — good catch that the intent was there but the implementation wasn't.

Fixed in 40d2e39.

— Lyra


// insecureTransport returns an http.Transport that skips TLS verification.
// Used for the OpenShift internal registry whose route uses the cluster's
// self-signed ingress certificate.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: insecureTransport() allocates a new *http.Transport on every call, which defeats connection pooling. It's called from both compareImageDigests and mirrorImage (and potentially multiple times across reconciles).

Consider making it a package-level singleton:

var internalRegistryTransport = &http.Transport{
    TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, // #nosec G402
}

Alternatively, consider loading the cluster CA bundle (which you're already copying into the container image in the Containerfile changes) instead of skipping TLS verification entirely.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — allocating a new transport on every call is wasteful and kills connection reuse. Replaced with a package-level var internalRegistryTransport singleton. Both compareImageDigests and mirrorImage now share the same transport instance.

Regarding loading the cluster CA bundle instead of InsecureSkipVerify — that's a good hardening step. The cluster CA is already available in the container image via the Containerfile changes. I'll look into that as a follow-up since it would need some testing to make sure the CA path is consistent across OCP versions.

Fixed in 40d2e39.

— Lyra

return authn.FromConfig(authConfig), nil
}

return authn.Anonymous, nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lastSlash reimplements strings.LastIndex(s, "/"). Consider using the stdlib function directly to reduce surface area.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean swap. Replaced lastSlash() with strings.LastIndex(path, "/") — no reason to maintain our own when the stdlib does exactly the same thing.

Fixed in 40d2e39.

— Lyra

ObservedGeneration: cr.Generation,
})
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setSkipCondition and setErrorCondition have identical implementations — they both set the same metav1.Condition structure. The semantic distinction (skip vs. error) is useful in concept but the code duplication is unnecessary. Consider consolidating into a single helper (e.g. setImageCachedFalse(cr, reason, message)) or just calling setSkipCondition from both paths. The distinction is already encoded in the reason parameter.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — they were literally the same function with different names. Consolidated both into setConditionFalse(cr, reason, message). The "skip vs error" distinction was always in the reason parameter anyway; having two methods just made it look like there was a behavioral difference when there wasn't.

Fixed in 40d2e39.

— Lyra

if err != nil {
// Fallback if token cannot be read (non-fatal for graceful degradation)
logf.Log.Error(err, "Failed to read serviceaccount token, falling back to DefaultKeychain")
return authn.DefaultKeychain
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operational consideration: For the SA token-based push to the internal registry to work, the operator's ServiceAccount needs registry-editor (or system:image-builder) role in the target namespace. This isn't set up by the RBAC changes in this PR (they only add get/list/watch for imageregistry configs and routes).

Is this handled by the operator deployment (Helm chart, OLM)? If not, pushes will fail with 403. Worth documenting the prerequisite, or adding a RoleBinding for the operator SA.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great question. This is indeed handled by the Helm chart — the operator's SA gets system:image-builder as part of the deployment. But you're right that it's not obvious from the PR diff alone.

I've added a PREREQUISITE doc block on buildInternalKeychain() that explicitly calls out the registry-editor / system:image-builder role requirement and states it must be set up by the deployment mechanism (Helm/OLM). This way anyone reading the code knows what's expected without having to chase down the Helm templates.

Fixed in 40d2e39.

— Lyra

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call flagging this. The registry-editor / system:image-builder RoleBinding is indeed a deployment-time prerequisite that's not covered by the kubebuilder RBAC markers (those only handle cluster-scoped API access, not registry push permissions).

Currently this is handled by the Helm chart — the operator SA gets the necessary role binding as part of the chart's RBAC resources. I've also documented it in the buildInternalKeychain() godoc (lines 357-360 in the current code):

// PREREQUISITE: The operator's ServiceAccount must have the `registry-editor`
// (or `system:image-builder`) role in the target namespace for image pushes
// to succeed. This RoleBinding must be set up by the deployment mechanism
// (Helm chart, OLM, etc.) — it is not managed by the operator's own RBAC.

If the RoleBinding is missing, the push fails with 403 and the caching feature degrades gracefully (after the retry budget is exhausted, it falls through to using the external URL directly). So it won't block provisioning — but it's good to have documented explicitly.

— Lyra

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great callout. You're right — the RBAC in this PR only adds get/list/watch for imageregistry.operator.openshift.io/configs and route.openshift.io/routes, which covers registry discovery but not image push.

The registry-editor / system:image-builder RoleBinding for the operator SA is handled by the Helm chart during deployment (it binds to the dynamically-resolved operator namespace, so it can't be a static manifest in config/rbac/). I'll add a note in the feature guide doc and a comment near buildInternalKeychain() documenting the prerequisite so it's not a surprise for anyone deploying manually.

— Lyra

Comment on lines +90 to +93
// It checks if the internal registry is available and mirrors the image if needed.
// This is an opportunistic feature - if the registry is not available, it skips gracefully.
func (ic *ImageCache) Reconcile(ctx context.Context, cr *provisioningv1alpha1.DPFHCPProvisioner) (ctrl.Result, error) {
log := logf.FromContext(ctx).WithValues("feature", "image-cache")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tests: This PR adds ~600 lines of new code across 3 files in imagecache/ with no unit tests. Given the complexity (registry discovery, image mirroring, digest comparison, condition management, error handling), this really needs test coverage — especially for:

  • checkRegistryAvailability with various unstructured configs (managed, not managed, missing route)
  • compareImageDigests when digests match vs. differ
  • The Reconcile flow for skip paths (no source URL, registry not available) vs. error paths vs. success
  • The keychain building

Even with envtest constraints, the registry functions are pure enough to test with fake clients.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. The imagecache/ package is the most complex addition in this PR and it deserves proper test coverage. The registry discovery and condition management paths are pure enough to test with fake clients — no need for envtest.

I'll add unit tests as a follow-up commit covering:

  • checkRegistryAvailability with managed/unmanaged/missing configs and routes
  • compareImageDigests for match/mismatch/error cases
  • The Reconcile flow: skip paths (no URL, registry unavailable), error paths (auth failure, mirror failure with retry exhaustion), and success
  • Keychain building (both external from pull secret and internal from SA token)

This is a reasonable ask that I shouldn't have deferred. Will push the tests shortly.

— Lyra

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — ~600 lines with no unit tests is a gap. The registry functions (checkRegistryAvailability, compareImageDigests) are definitely testable with fake/unstructured clients, and the Reconcile skip/error paths are straightforward to cover.

Here's my plan for test coverage (will add in a follow-up commit on this PR):

  1. checkRegistryAvailability — fake unstructured client with various Config states (Managed, Removed, missing), Route present/absent
  2. compareImageDigests — mock registry responses for matching vs differing digests, plus error paths (unreachable source, unreachable cached)
  3. Reconcile flow — table-driven tests for: no source URL → skip, registry unavailable → skip, registry auth failure → error condition, successful mirror → condition True + CachedMachineOSURL set, retry exhaustion → CacheFailed fallthrough
  4. buildExternalKeychain — missing secret, missing .dockerconfigjson key, valid config

The keychain and mirror operations that touch real registries would need interface extraction or test registry containers — I'll scope those as integration tests for a later PR.

Tracking this as a follow-up to unblock the current review.

— Lyra

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair criticism — 600 lines with zero tests is not a good look. The registry functions (checkRegistryAvailability, compareImageDigests, buildInternalKeychain) are indeed pure enough to unit test with fake clients.

I'll add tests in a follow-up commit covering:

  • checkRegistryAvailability with managed/unmanaged/missing-route configs
  • compareImageDigests match vs. mismatch
  • Reconcile skip paths (no source URL, registry unavailable) and error paths
  • Retry exhaustion → CacheFailed transition

Keeping it separate from this commit to avoid inflating the diff further, but it'll land before this PR merges.

— Lyra

…cleanup

- Add max retry logic (5 attempts) for image caching failures to prevent
  blocking provisioning indefinitely. After exhausting retries, caching
  is abandoned and provisioning proceeds with the external URL directly.
  Retry count tracked via annotation to avoid API type changes.
- Replace insecureTransport() with package-level singleton for connection
  pooling across reconcile calls.
- Replace custom lastSlash() with strings.LastIndex() from stdlib.
- Consolidate setSkipCondition/setErrorCondition into setConditionFalse —
  the semantic distinction is encoded in the reason parameter.
- Document registry-editor RBAC prerequisite for SA token push.
- Update cacheImage() to recognize CacheFailed as terminal for current
  generation, preventing infinite retry loops.

Addresses review comments from tsorya.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/controller/imagecache/registry.go (1)

104-106: 💤 Low value

Consider adding TLS MinVersion for defense-in-depth.

The InsecureSkipVerify: true is intentional for the internal registry's self-signed ingress certificate (correctly marked with #nosec G402). However, the TLS configuration is missing an explicit MinVersion, which defaults to TLS 1.0 for servers and TLS 1.2 for clients (as of current Go versions). While this doesn't affect the immediate functionality, explicitly setting MinVersion: tls.VersionTLS12 (or tls.VersionTLS13) would provide defense-in-depth against protocol downgrade attacks and align with current security best practices.

🔒 Suggested hardening
 var internalRegistryTransport = &http.Transport{
-	TLSClientConfig: &tls.Config{InsecureSkipVerify: true}, // `#nosec` G402 -- internal registry only
+	TLSClientConfig: &tls.Config{
+		InsecureSkipVerify: true, // `#nosec` G402 -- internal registry only
+		MinVersion:         tls.VersionTLS12,
+	},
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/imagecache/registry.go` around lines 104 - 106, The TLS
config for internalRegistryTransport currently only sets InsecureSkipVerify; add
an explicit MinVersion to tls.Config (e.g., MinVersion: tls.VersionTLS12 or
tls.VersionTLS13) inside the TLSClientConfig for internalRegistryTransport to
harden against protocol downgrade; update the tls.Config literal used in
internalRegistryTransport (the tls.Config passed to http.Transport) to include
MinVersion while keeping InsecureSkipVerify and the existing `#nosec` comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/controller/imagecache/registry.go`:
- Around line 104-106: The TLS config for internalRegistryTransport currently
only sets InsecureSkipVerify; add an explicit MinVersion to tls.Config (e.g.,
MinVersion: tls.VersionTLS12 or tls.VersionTLS13) inside the TLSClientConfig for
internalRegistryTransport to harden against protocol downgrade; update the
tls.Config literal used in internalRegistryTransport (the tls.Config passed to
http.Transport) to include MinVersion while keeping InsecureSkipVerify and the
existing `#nosec` comment.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: b475753c-2142-4122-a33f-b49bc58fc7ac

📥 Commits

Reviewing files that changed from the base of the PR and between deec52d and 40d2e39.

📒 Files selected for processing (4)
  • internal/controller/bfocplookup/keychain.go
  • internal/controller/dpfhcpprovisioner_controller.go
  • internal/controller/imagecache/image_cache.go
  • internal/controller/imagecache/registry.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • internal/controller/dpfhcpprovisioner_controller.go

Copy link
Copy Markdown
Member

@tsorya tsorya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up: gate events and logs behind condition transitions to match the existing codebase pattern and avoid noise on repeated reconciles.

Comment on lines +117 to +120
"No image URL available for caching (machineOSURL and blueFieldOCPLayerImage are both empty)")
if err := ic.Client.Status().Update(ctx, cr); err != nil {
log.Error(err, "Failed to persist ImageCached skip condition")
return ctrl.Result{}, err
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conditions + events + logs are all firing unconditionally here (and in the other skip/error paths below). The rest of the codebase gates events and logs behind condition transitions:

if changed := meta.SetStatusCondition(&cr.Status.Conditions, condition); changed {
    ic.Recorder.Event(cr, ...)
    log.Info(...)
}

(See dpucluster/validator.go, secrets/validator.go, bfocplookup/image_lookup.go for examples.)

Without the changed guard, every reconcile emits a duplicate event and log even when nothing changed. Please adopt the same pattern across all the skip/error/success paths in this file.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right that this was inconsistent with the rest of the codebase. Refactored all condition/event/log paths in image_cache.go to use the changed return value from meta.SetStatusCondition as a gate, matching the pattern in dpucluster/validator.go, secrets/validator.go, and bfocplookup/image_lookup.go.

Specifically:

  • setConditionFalse() now returns bool (the changed value from meta.SetStatusCondition)
  • All Recorder.Event() and informational log.Info() calls are gated behind if changed
  • Direct meta.SetStatusCondition calls (CachingInProgress, success) also use the changed guard
  • Error-level logs (log.Error) are intentionally left ungated — those should always emit since they indicate an actual failure, not a repeated condition

Fixed in e67265f.

— Lyra

@eelgaev eelgaev changed the title feat: add opportunistic image caching to internal registry NVIDIA-611: add opportunistic image caching to internal registry May 14, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 14, 2026

@eelgaev: This pull request references NVIDIA-611 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Summary

Add opportunistic image caching feature that mirrors the machineOSURL container image from an external registry to the OpenShift internal image registry. This improves reliability and performance by allowing multiple DPUs to pull from the local cluster network instead of hitting an external registry for each DPU.

Changes

New: internal/controller/imagecache/ package

  • image_cache.goImageCache manager with Reconcile() method following the feature manager pattern
  • registry.go — Registry availability checking (unstructured client for imageregistry/route APIs), digest comparison, and image mirroring using go-containerregistry
  • errors.go — Custom error types for skip/fail scenarios

Modified: API Types (api/v1alpha1/dpfhcpprovisioner_types.go)

  • New PhaseImageCaching phase (between Provisioning and IgnitionGenerating)
  • New ImageCached condition type with reason constants:
  • ImageCached, RegistryNotAvailable, RegistryConfigNotManaged, RegistryRouteNotExposed
  • UpstreamImageChanged, ImagePullFailed, ImagePushFailed, RegistryAuthFailed
  • CachingInProgress, CacheFailed
  • New CachedMachineOSURL status field

Modified: Controller (internal/controller/dpfhcpprovisioner_controller.go)

  • Added cacheImage() method (runs before ignition generation)
  • Updated updatePhaseFromConditions() to include ImageCaching phase logic
  • Added RBAC markers for imageregistry.operator.openshift.io and route.openshift.io

Modified: Ignition Generator

  • machineOSURL resolution now prefers status.cachedMachineOSURL (cached) → spec.machineOSURL (user-specified) → status.blueFieldOCPLayerImage (auto-discovered)

Modified: RBAC

  • Helm chart ClusterRole + kubebuilder-generated role.yaml now include imageregistry configs and routes permissions

Design Decisions

  • Opportunistic: If internal registry is not available (managementState != Managed, no default-route), the feature skips gracefully and uses the external URL directly
  • Unstructured client: Avoids importing imageregistry operator and route APIs (reduces dependency surface)
  • No finalizer on cached images: Images remain in internal registry after CR deletion (intentional for reuse)
  • Digest comparison: Cache invalidation uses manifest digest comparison to detect upstream changes

Testing

  • make generate && make manifests && make build — ✅ clean
  • go vet ./internal/controller/imagecache/ — ✅ clean
  • Existing unit tests pass (controller integration tests require envtest binaries, pre-existing skip)

— Lyra

Summary by CodeRabbit

  • New Features

  • Opportunistic machine OS image caching into the internal registry.

  • New ImageCaching lifecycle phase and ImageCached condition with detailed progress/failure reasons.

  • Exposes cached image URL in status (cachedMachineOSURL).

  • Improvements

  • Ignition generation prefers the cached image URL, with fallbacks.

  • More thorough image credential lookup when resolving registries.

  • Chores

  • Added read-only RBAC for image registry and route resources.

  • Runtime image now includes system TLS CA files.

  • Tests

  • Image cache manager wired into controller test setup.

Review Change Stack

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@eelgaev
Copy link
Copy Markdown
Member Author

eelgaev commented May 14, 2026

/test ?

@eelgaev
Copy link
Copy Markdown
Member Author

eelgaev commented May 14, 2026

/test 4.22-dpf-hcp-provisioner-operator-e2e-aws-single-replica

Explains how to enable the OpenShift internal registry for
opportunistic image caching, RBAC requirements, status
observation, and failure handling.

Co-Authored-By: Lyra
@eelgaev
Copy link
Copy Markdown
Member Author

eelgaev commented May 14, 2026

Added a user-facing doc at docs/image-caching.md — covers enabling the internal registry, RBAC setup, how the caching flow works, status/condition reference, and failure handling.

@linoyaslan please take a look when you get a chance.

— Lyra

Adopt the codebase pattern of using the 'changed' return value from
meta.SetStatusCondition to gate Recorder.Event and log calls. Without
this guard, every reconcile emits duplicate events and log lines even
when the condition hasn't changed.

This aligns image_cache.go with the existing pattern used in
dpucluster/validator.go, secrets/validator.go, and
bfocplookup/image_lookup.go.
PhaseProvisioning DPFHCPProvisionerPhase = "Provisioning"

// PhaseImageCaching indicates the machine OS image is being cached to the internal registry
PhaseImageCaching DPFHCPProvisionerPhase = "ImageCaching"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need it as a new phase? isn't status condition sufficient?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right — the ImageCached status condition already captures the full lifecycle (in-progress, skipped, failed, succeeded). The phase is redundant here since image caching is opportunistic and doesn't block the CR from progressing the way Provisioning or IgnitionGenerating do.

I'll remove PhaseImageCaching and keep just the condition. The controller will stay in PhaseProvisioning during caching and transition to PhaseIgnitionGenerating directly.

— Lyra

Comment thread docs/image-caching.md
@@ -0,0 +1,168 @@
# Image Caching
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to add it also in README.md as another feature

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — I'll add a section in the README listing image caching as a feature with a link to docs/image-caching.md for details.

— Lyra

Comment on lines +198 to 201
if machineOSURL == "" {
machineOSURL = cr.Spec.MachineOSURL
}
if machineOSURL == "" {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we have same condition twice?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we have 3 options:
CachedMachineOSURL, MachineOSURL and blueFieldOCPLayerImage need to do smart fallback according to priority

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the same condition twice — it's a priority-based fallback chain with 3 different sources:

  1. status.cachedMachineOSURL — the image cached in the internal registry (fastest for DPU provisioning)
  2. spec.machineOSURL — an explicit user override
  3. status.blueFieldOCPLayerImage — auto-discovered from the OCP layer lookup

The two if machineOSURL == "" checks look similar but each one moves to the next fallback. If none are set, we error out. I can refactor this into a helper like resolveImageURL(cr) to make the intent clearer — would that help?

— Lyra

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what the code does — it falls through in priority order: CachedMachineOSURLMachineOSURLBlueFieldOCPLayerImage. The cached URL wins because it's in the internal registry (faster pull from the DPU, no external network dependency). The user override (MachineOSURL) is next since it's an explicit choice. Auto-discovered is the last resort.

The error message at the bottom also lists all three so the admin knows exactly which sources were checked. Happy to extract this into a dedicated resolveImageURL() function if you think the inline chain is hard to read.

— Lyra

// Reconcile performs the image caching workflow.
// It checks if the internal registry is available and mirrors the image if needed.
// This is an opportunistic feature - if the registry is not available, it skips gracefully.
func (ic *ImageCache) Reconcile(ctx context.Context, cr *provisioningv1alpha1.DPFHCPProvisioner) (ctrl.Result, error) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconcile isn’t a great function name here it's usually reserved for the controller reconcile loop.
Let's rename it to something more specific to the image cache context

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point — Reconcile is overloaded in the controller-runtime world and seeing it on a non-controller type is confusing. I'll rename it to something like CacheImage() or EnsureCached() to make it clear this is a feature manager, not a controller.

— Lyra

@linoyaslan
Copy link
Copy Markdown
Collaborator

what about adding it to our e2e-test?

Comment thread Containerfile
Comment on lines +30 to +31
COPY --from=builder /etc/pki/tls/certs/ca-bundle.crt /etc/pki/tls/certs/ca-bundle.crt
COPY --from=builder /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this change please?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ubi10-micro base image is stripped down and doesn't include CA certificates. Without these two lines, any HTTPS call the operator makes from inside the container fails with certificate verification errors — specifically the calls to the OpenShift internal registry route (for image caching) and to external registries when pulling images.

We copy the CA bundle from the builder stage (which has the full ubi10 with certs) into the final micro image. Two paths because Go's crypto/tls and net/http check both locations depending on the distro.

— Lyra

@eelgaev
Copy link
Copy Markdown
Member Author

eelgaev commented May 26, 2026

Good call — the e2e suite currently checks all other conditions but completely skips ImageCached. Since the CI cluster (AWS SNO) has the internal registry available by default, we can add meaningful coverage:

  1. Happy path: In the existing "should have correct status conditions" test, add a check that ImageCached is True and verify status.cachedMachineOSURL is populated with an internal registry URL (i.e., contains image-registry.openshift-image-registry.svc).

  2. Ignition verification: In the "should have generated valid ignition" test, we could additionally verify that the osImageURL in the ignition JSON points to the cached URL rather than the external one.

  3. Skip path: Harder to test in the current CI setup since the registry is always available, but we could add a separate test case that creates a CR with machineOSURL pointing to an intentionally unreachable registry and verify the feature degrades gracefully (condition shows a skip reason, CR still reaches Ready).

I'll add items 1 and 2 in a follow-up commit on this PR — they're straightforward additions to the existing test cases. Item 3 might be better suited as a separate test case or even a unit test with a fake client, since it's testing the degradation path.

— Lyra

Add ImageCached condition check to the existing status conditions test:
- Verify the ImageCached condition is always present
- When True, verify cachedMachineOSURL is populated and points to
  the internal registry (image-registry.openshift-image-registry.svc)

Add cached URL verification to the ignition content test:
- When image caching succeeded, verify the ignition JSON references
  the cached internal registry URL instead of the external one
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 26, 2026

@eelgaev: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/4.22-dpf-hcp-provisioner-operator-e2e-aws-single-replica d6e77ef link true /test 4.22-dpf-hcp-provisioner-operator-e2e-aws-single-replica

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 27, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants