Skip to content

feat: implement E2E deployment and test execution in e2e-ci.sh#266

Open
yossiovadia wants to merge 5 commits into
opendatahub-io:mainfrom
yossiovadia:feat/e2e-ci-deployment
Open

feat: implement E2E deployment and test execution in e2e-ci.sh#266
yossiovadia wants to merge 5 commits into
opendatahub-io:mainfrom
yossiovadia:feat/e2e-ci-deployment

Conversation

@yossiovadia
Copy link
Copy Markdown
Contributor

@yossiovadia yossiovadia commented May 12, 2026

Summary

Implements the TODO in test/e2e/scripts/e2e-ci.sh — the primary E2E orchestrator called by the Konflux pipeline (odh-konflux-central/integration-tests/ai-gateway-payload-processing/pr-group-testing-pipeline.yaml).

What this does

Step Action
1 Preflight checks: cluster connectivity, Istio, Gateway API CRDs, simulator reachability
2 Install ExternalModel CRD (if missing)
3 Create Gateway for routing (idempotent)
4 Deploy payload-processing via Helm chart using PAYLOAD_PROCESSING_IMAGE
5 Create test namespace with Istio injection
6 Run E2E tests as a Kubernetes Pod using pre-built PAYLOAD_PROCESSING_E2E_IMAGE
7 Extract JUnit report and report pass/fail

The test pod uses an init container pattern: the init container runs the ginkgo test binary, writes JUnit XML to a shared volume, and the main container stays alive for result extraction.

How it integrates with Konflux

The Konflux group-test pipeline already calls ./test/e2e/scripts/e2e-ci.sh and sets PAYLOAD_PROCESSING_IMAGE and PAYLOAD_PROCESSING_E2E_IMAGE from the Konflux snapshot. No pipeline changes needed — this PR fills in the script that the pipeline already invokes.

Environment variables

Variable Default Description
E2E_SIMULATOR_ENDPOINT 3.147.232.199 Simulator IP
PAYLOAD_PROCESSING_IMAGE quay.io/.../odh-stable BBR image to deploy
PAYLOAD_PROCESSING_E2E_IMAGE quay.io/.../odh-stable Pre-built test image
E2E_GATEWAY_NAMESPACE istio-system Gateway namespace
E2E_GATEWAY_NAME e2e-gateway Gateway name
E2E_NS e2e-models Test namespace
E2E_LABEL_FILTER (all) Ginkgo label filter
E2E_CLEANUP true Clean up after tests

Test plan

  • Tested on Kind cluster (bbr-e2e): 10 passed, 0 failed, 1 skipped
  • JUnit report extracted successfully (9589 bytes, valid XML)
  • Cleanup removes all created resources (Helm release, namespace, gateway)
  • Idempotent — safe to re-run on existing deployment (Helm upgrade)

Risk analysis

  • Risk rating: 2
  • Why: Only modifies a test script, no production code changes. The script is called by the existing Konflux pipeline which currently hits the TODO and exits. Worst case on failure: test pipeline fails (which it effectively does today since the TODO is a no-op). The Helm deployment and test execution patterns are proven — reused from setup-kind.sh and entrypoint.sh.

Summary by CodeRabbit

  • Tests
    • Replaced the simple E2E helper with a full CI end-to-end orchestrator: adds comprehensive preflight checks, automated CRD/gateway provisioning, deployment install/upgrade and rollout wait, isolated test-runner pods with log collection, teardown/cleanup on exit, and export of JUnit results with proper exit-code propagation for reliable, repeatable E2E runs.

Fills in the TODO placeholder with full deployment and test orchestration:

1. Preflight checks (cluster, Istio, Gateway API, simulator connectivity)
2. Install ExternalModel CRD
3. Create Gateway for routing
4. Deploy payload-processing via Helm chart
5. Create test namespace with Istio injection
6. Run E2E tests as a Kubernetes Pod using pre-built test image
7. Collect JUnit report and report pass/fail

Designed for Konflux — the pipeline already calls this script with
PAYLOAD_PROCESSING_IMAGE and PAYLOAD_PROCESSING_E2E_IMAGE from the
Konflux snapshot. Also works on Kind for local validation.

Tested on Kind cluster: 10 passed, 0 failed, 1 skipped.
JUnit report extracted successfully (9.5KB).
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yossiovadia

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

The previous minimal script was replaced by a full E2E CI orchestrator (test/e2e/scripts/e2e-ci.sh). It computes script/project paths, sets strict shell options, defines many E2E-related environment variables and defaults, prints a structured run header, and installs a cleanup trap that tears down namespaces, Helm releases, gateways, and test pods. The script implements preflight checks (kubectl connectivity, Gateway API CRDs, Istio presence, simulator /health), installs the ExternalModel CRD if missing, creates and waits for an Istio Gateway, deploys payload-processing via Helm (install vs upgrade) with image parsing for Helm --set values and disables Istio sidecar injection on that deployment, sets up a test namespace with Istio injection, creates a test runner ServiceAccount and ClusterRoleBinding, launches a pod whose init container runs the E2E binary and writes a JUnit XML plus an exit-code file, waits for and streams logs, copies the JUnit report from the pod to the host, and exits with the test exit code.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Security findings

CWE-78 (OS Command Injection via unvalidated image parsing)

  • Code: splitting PAYLOAD_PROCESSING_IMAGE with IFS and parameter expansion.
  • Risk: Malformed image strings may inject unexpected values into Helm --set arguments.
  • Action: Validate image against a strict regex (e.g., ^([a-z0-9.-]+(?::[0-9]+)?/)?[a-z0-9.-]+(/[a-z0-9.-]+)*:[A-Za-z0-9._-]+$) or canonicalize with a parser; fail fast on mismatch.

CWE-273 (Excessive Privileges from ClusterRoleBinding to cluster-admin)

  • Code: creating ClusterRoleBinding granting cluster-admin to test runner SA.
  • Risk: Test pod obtains full cluster privileges.
  • Action: Replace with least-privilege Role/RoleBinding scoped to required resources or use a dedicated short-lived bootstrapper with constrained rights.

CWE-434 (Unsafe file copy from pod via kubectl cp)

  • Code: kubectl cp "$E2E_NAMESPACE/$TEST_POD_NAME:/results/e2e-junit.xml" ...
  • Risk: Pod could create symlinks or manipulate files to overwrite host files.
  • Action: Ensure pod writes to a fresh, dedicated directory owned by non-root; validate file is regular, within size limits, and has expected content/type before placing on host.

CWE-532 (Sensitive Data in Logs)

  • Code: streaming pod logs from test container into CI output.
  • Risk: Secrets/PII may be exposed in CI logs.
  • Action: Scrub or redact known secret patterns, limit log retention, and avoid printing full logs by default; mark logs as sensitive in CI when possible.

CWE-327 / CWE-295 (TLS validation assumptions)

  • Code: curl -sS "https://$E2E_SIMULATOR_ENDPOINT/health"
  • Risk: Implicit trust in system CAs or accidental disabling of verification leads to MITM risk.
  • Action: Explicitly validate TLS (use --cacert or certificate pinning) or make insecure option explicit and gated behind a flag.

CWE-426 (Untrusted Search Path / tool resolution)

  • Code: use of readlink, dirname, cd without guarded path resolution.
  • Risk: If PATH utilities are tampered, resolution may be hijacked.
  • Action: Use fully qualified tool paths (resolve via command -v), or run in a controlled environment; validate resolved PROJECT_ROOT.

Only actionable issues are listed. No further findings.

🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically summarizes the main change: implementing E2E deployment and test execution in the e2e-ci.sh script, which is the core focus of the changeset.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 12, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
test/e2e/scripts/e2e-ci.sh (1)

346-346: 💤 Low value

Unquoted variable in exit statement.

Variable $TEST_EXIT should be quoted to prevent glob expansion (though unlikely to cause issues with numeric values, it violates defensive coding practices).

-exit $TEST_EXIT
+exit "$TEST_EXIT"

As per coding guidelines: "SHELL SCRIPT SECURITY (CWE-78): Quote all variables to prevent injection."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` at line 346, The exit statement uses an unquoted
variable `TEST_EXIT` in the e2e-ci.sh script; change the exit invocation to
quote the variable (replace the unquoted `exit $TEST_EXIT` with a quoted form
such as `exit "$TEST_EXIT"`) to prevent shell globbing or word-splitting and
satisfy the SHELL SCRIPT SECURITY guideline; ensure any other uses of
`TEST_EXIT` in the script are similarly quoted.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/scripts/e2e-ci.sh`:
- Line 223: The script grants cluster-admin to the e2e-test-runner
ServiceAccount (search for the literal "name: cluster-admin" and references to
e2e-test-runner/E2E_JOB_NAME/E2E_NS); replace this with a namespace-scoped Role
and RoleBinding that grant only needed verbs/resources (pods, services,
configmaps, secrets, limited apps/deployments, and the maas.opendatahub.io
externalmodels rules) and bind that Role to the e2e-test-runner ServiceAccount,
and optionally add a separate readonly ClusterRole (e.g.,
${E2E_JOB_NAME}-readonly) only if cross-namespace GET/LIST access
(gateways/httproutes) is required; update the script to apply those
Role/RoleBinding/ClusterRole manifests instead of assigning cluster-admin to the
ServiceAccount.
- Line 119: Replace the direct kubectl apply from the external URL with a secure
flow: either vendor the CRD into the repo (e.g., add
test/e2e/manifests/externalmodel-crd.yaml and change the apply to use
"$SCRIPT_DIR/../manifests/externalmodel-crd.yaml") or fetch the remote CRD into
a temp file and verify its integrity before applying by computing and comparing
a known EXPECTED_SHA (use a CRD_URL variable, download to /tmp/crd.yaml, verify
sha256sum -c against EXPECTED_SHA, then kubectl apply -f /tmp/crd.yaml); update
the script around the current kubectl apply -f
"https://raw.githubusercontent.com/.../maas.opendatahub.io_externalmodels.yaml"
invocation accordingly.
- Around line 103-106: The curl health-check in e2e-ci.sh currently uses
--insecure and an unquoted ${E2E_SIMULATOR_ENDPOINT}; remove the --insecure flag
and replace it with a trusted CA option (e.g., use --cacert pointing to the
simulator CA bundle or system CA) and ensure the URL variable is quoted
("https://${E2E_SIMULATOR_ENDPOINT}/health") to prevent injection; update the
curl invocation (the health-check command using --insecure and --write-out) to
fail if certificate validation fails unless a CA bundle is provided via
environment or mounted path, and document/validate the CA path via an env
variable like SIMULATOR_CA if needed.
- Around line 262-267: The heredoc currently contains parameter expansion
`${LABEL_FILTER_ARG:+"-ginkgo.label-filter=${E2E_LABEL_FILTER}"}` which fails
because LABEL_FILTER_ARG is set to a YAML string earlier and not a shell CLI
fragment; fix by preparing a proper CLI argument variable before the heredoc
(e.g., build LABEL_FILTER_CLI or set LABEL_FILTER_ARG to either an empty string
or the exact `-ginkgo.label-filter="..."` token using E2E_LABEL_FILTER) and then
reference that single precomputed variable inside the heredoc/command; update
the e2e-ci.sh logic that sets LABEL_FILTER_ARG / E2E_LABEL_FILTER so the value
used in the exec line is a valid shell argument (no in-heredoc parameter
expansion of YAML content).
- Around line 320-322: The current extraction of TEST_EXIT_CODE via kubectl exec
with a fallback echo "1" masks infrastructure/read error as a test failure;
update the logic around TEST_EXIT_CODE and TEST_EXIT to detect kubectl execution
failure vs a real exit-code file: run the kubectl exec invocation and capture
its stdout and exit status separately (the invocation that reads
/results/exit-code inside the container), if kubectl itself fails or the file is
missing/unreadable set a distinct sentinel (e.g., TEST_EXIT_CODE="INFRA_FAILURE"
or numeric like 125) and emit a clear error log, otherwise parse the numeric
file into TEST_EXIT; ensure TEST_EXIT is set from that sentinel/number so
callers can differentiate infra errors from test failures (refer to the
variables TEST_EXIT_CODE, TEST_EXIT and the kubectl exec command in the diff).
- Around line 158-160: The current cut-based parsing of PAYLOAD_PROCESSING_IMAGE
into IMG_REGISTRY, IMG_REPO, and IMG_TAG is brittle (variables:
PAYLOAD_PROCESSING_IMAGE, IMG_REGISTRY, IMG_REPO, IMG_TAG) and fails for
registries with ports or multi-segment repos; instead stop splitting by blind
cut and either (A) pass the entire PAYLOAD_PROCESSING_IMAGE to Helm (e.g., set
upstreamBbr.bbr.image.fullReference) so the chart receives a single canonical
image string, or (B) implement robust parsing using a single
regex/parameter-expansion routine that extracts optional registry (with possible
:port), repository (including subpaths), and tag/digest and validates the format
before use; also add a defensive check that PAYLOAD_PROCESSING_IMAGE is
non-empty and abort with an error if parsing fails.

---

Nitpick comments:
In `@test/e2e/scripts/e2e-ci.sh`:
- Line 346: The exit statement uses an unquoted variable `TEST_EXIT` in the
e2e-ci.sh script; change the exit invocation to quote the variable (replace the
unquoted `exit $TEST_EXIT` with a quoted form such as `exit "$TEST_EXIT"`) to
prevent shell globbing or word-splitting and satisfy the SHELL SCRIPT SECURITY
guideline; ensure any other uses of `TEST_EXIT` in the script are similarly
quoted.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b99d7575-e9f3-47e7-8827-0ac59092112e

📥 Commits

Reviewing files that changed from the base of the PR and between 3ea8171 and c5b0e2c.

📒 Files selected for processing (1)
  • test/e2e/scripts/e2e-ci.sh

Comment on lines +103 to +106
if curl --silent --fail --insecure --max-time 10 --output /dev/null \
--write-out " Simulator: HTTP %{http_code} (%{time_total}s)\n" \
"https://${E2E_SIMULATOR_ENDPOINT}/health" 2>/dev/null; then
echo " Simulator: reachable"
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

CWE-295: TLS verification disabled on simulator health check.

The --insecure flag bypasses certificate validation, exposing the connection to MitM attacks. If the simulator uses self-signed certificates, pin the CA or use --cacert instead.

🔒 Proposed fix
-if curl --silent --fail --insecure --max-time 10 --output /dev/null \
+if curl --silent --fail --max-time 10 --output /dev/null \
     --write-out "  Simulator: HTTP %{http_code} (%{time_total}s)\n" \
     "https://${E2E_SIMULATOR_ENDPOINT}/health" 2>/dev/null; then

If self-signed certs are required:

# Option: Use --cacert with a mounted CA bundle
curl --cacert /path/to/simulator-ca.crt ...

As per coding guidelines: "SHELL SCRIPT SECURITY (CWE-78): Quote all variables to prevent injection."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` around lines 103 - 106, The curl health-check in
e2e-ci.sh currently uses --insecure and an unquoted ${E2E_SIMULATOR_ENDPOINT};
remove the --insecure flag and replace it with a trusted CA option (e.g., use
--cacert pointing to the simulator CA bundle or system CA) and ensure the URL
variable is quoted ("https://${E2E_SIMULATOR_ENDPOINT}/health") to prevent
injection; update the curl invocation (the health-check command using --insecure
and --write-out) to fail if certificate validation fails unless a CA bundle is
provided via environment or mounted path, and document/validate the CA path via
an env variable like SIMULATOR_CA if needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentional. The simulator uses Let's Encrypt certs for its sslip.io hostname. When accessed by IP (which is the default for E2E — the ginkgo tests use STATIC ServiceEntry with the IP), TLS cert validation fails because the cert doesn't match the raw IP. Same pattern used in the simulator check at the top of the original script.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yossiovadia, understood — the cert/IP mismatch is a real constraint when using the raw IP default.

That said, there's a narrower fix available: the preflight curl runs out-of-cluster (on the CI runner), so DNS resolution works fine there. You could use the sslip.io hostname only for the connectivity check while keeping the raw IP for the in-cluster ginkgo ServiceEntry:

# Derive sslip.io hostname from raw IP for the preflight TLS check
SIMULATOR_HEALTH_HOST="${E2E_SIMULATOR_ENDPOINT//./-}.sslip.io"
if curl --silent --fail --max-time 10 --output /dev/null \
    --write-out "  Simulator: HTTP %{http_code} (%{time_total}s)\n" \
    "https://${SIMULATOR_HEALTH_HOST}/health" 2>/dev/null; then

This separates concerns: TLS-validated reachability check (curl, out-of-cluster, hostname) vs. in-cluster routing (ginkgo, raw IP via ServiceEntry). At minimum, if you keep --insecure, add an inline comment so future reviewers don't flag it again:

# --insecure: cert is issued for sslip.io hostname; raw-IP access fails SNI match
if curl --silent --fail --insecure --max-time 10 ...

✏️ Learnings added
Learnt from: yossiovadia
Repo: opendatahub-io/ai-gateway-payload-processing PR: 266
File: test/e2e/scripts/e2e-ci.sh:116-119
Timestamp: 2026-05-12T19:51:17.390Z
Learning: In `test/e2e/scripts/e2e-ci.sh`, `--insecure` on the simulator curl health check is intentional. The simulator has a Let's Encrypt cert issued for its sslip.io hostname (e.g., 3-147-232-199.sslip.io), but `E2E_SIMULATOR_ENDPOINT` defaults to the raw IP (3.147.232.199). TLS cert validation fails on raw IP because the cert CN/SAN doesn't match. The ginkgo tests use a STATIC ServiceEntry with the raw IP for in-cluster routing, so the raw IP default is also intentional.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


if ! kubectl get crd externalmodels.maas.opendatahub.io >/dev/null 2>&1; then
echo " Installing ExternalModel CRD (maas.opendatahub.io)..."
kubectl apply -f https://raw.githubusercontent.com/opendatahub-io/models-as-a-service/refs/heads/main/deployment/base/maas-controller/crd/bases/maas.opendatahub.io_externalmodels.yaml
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

CWE-494: External CRD loaded without integrity verification.

Fetching CRDs from GitHub without checksum or signature validation allows supply-chain attacks. An attacker who compromises the upstream repo or performs MitM can inject malicious resource definitions.

🛡️ Mitigation options

Option 1: Vendor the CRD in-tree

# Store CRD at test/e2e/manifests/externalmodel-crd.yaml
kubectl apply -f "$SCRIPT_DIR/../manifests/externalmodel-crd.yaml"

Option 2: Verify with sha256sum

CRD_URL="https://raw.githubusercontent.com/opendatahub-io/models-as-a-service/refs/heads/main/deployment/base/maas-controller/crd/bases/maas.opendatahub.io_externalmodels.yaml"
EXPECTED_SHA="<commit_sha_or_content_hash>"
curl -fsSL "$CRD_URL" | tee /tmp/crd.yaml | sha256sum -c <(echo "$EXPECTED_SHA  -")
kubectl apply -f /tmp/crd.yaml

As per coding guidelines: "REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` at line 119, Replace the direct kubectl apply
from the external URL with a secure flow: either vendor the CRD into the repo
(e.g., add test/e2e/manifests/externalmodel-crd.yaml and change the apply to use
"$SCRIPT_DIR/../manifests/externalmodel-crd.yaml") or fetch the remote CRD into
a temp file and verify its integrity before applying by computing and comparing
a known EXPECTED_SHA (use a CRD_URL variable, download to /tmp/crd.yaml, verify
sha256sum -c against EXPECTED_SHA, then kubectl apply -f /tmp/crd.yaml); update
the script around the current kubectl apply -f
"https://raw.githubusercontent.com/.../maas.opendatahub.io_externalmodels.yaml"
invocation accordingly.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same pattern used in entrypoint.sh:45 and setup-kind.sh:86 — both fetch the same CRD from the same URL. Changing it here but not there would be inconsistent. Happy to address all three in a follow-up if we want to vendor the CRD.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify the external CRD fetch pattern in the other files mentioned
rg -n "kubectl apply -f https" --type sh

Length of output: 957


@yossiovadia Confirmed — the same unverified kubectl apply -f <raw_github_url> pattern appears in all three files:

File Line URL
test/e2e/scripts/entrypoint.sh 45 maas.opendatahub.io_externalmodels.yaml
test/e2e/scripts/setup-kind.sh 86 maas.opendatahub.io_externalmodels.yaml
test/e2e/scripts/e2e-ci.sh 132 maas.opendatahub.io_externalmodels.yaml

Also worth noting: setup-kind.sh:80 and e2e-ci.sh:93 apply the Gateway API CRDs from kubernetes-sigs/gateway-api the same way.

The consistency argument holds. Would you like me to open a follow-up GitHub issue to track vendoring (or sha256 pinning) across all three files for both CRD sources?

Comment on lines +158 to +160
IMG_REGISTRY="$(echo "$PAYLOAD_PROCESSING_IMAGE" | cut -d/ -f1)"
IMG_REPO="$(echo "$PAYLOAD_PROCESSING_IMAGE" | cut -d/ -f2- | cut -d: -f1)"
IMG_TAG="$(echo "$PAYLOAD_PROCESSING_IMAGE" | cut -d: -f2)"
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Brittle image parsing vulnerable to malformed input.

The cut-based parsing assumes registry/repository:tag format. Non-standard registries (e.g., localhost:5000/repo, gcr.io/project/subpath/image:tag) will produce incorrect values. If PAYLOAD_PROCESSING_IMAGE is externally supplied (Konflux pipeline), malformed input could break deployment or inject unintended registry values into Helm.

🔧 Robust parsing using regex or skopeo
-IMG_REGISTRY="$(echo "$PAYLOAD_PROCESSING_IMAGE" | cut -d/ -f1)"
-IMG_REPO="$(echo "$PAYLOAD_PROCESSING_IMAGE" | cut -d/ -f2- | cut -d: -f1)"
-IMG_TAG="$(echo "$PAYLOAD_PROCESSING_IMAGE" | cut -d: -f2)"
+# Parse image reference: registry/repository:tag
+if [[ "$PAYLOAD_PROCESSING_IMAGE" =~ ^([^/]+)/(.+):([^:]+)$ ]]; then
+    IMG_REGISTRY="${BASH_REMATCH[1]}"
+    IMG_REPO="${BASH_REMATCH[2]}"
+    IMG_TAG="${BASH_REMATCH[3]}"
+else
+    echo "ERROR: Invalid image format: $PAYLOAD_PROCESSING_IMAGE"
+    echo "Expected: registry/repository:tag"
+    exit 1
+fi

Alternative: use the full image reference directly if Helm chart supports it:

--set upstreamBbr.bbr.image.fullReference="$PAYLOAD_PROCESSING_IMAGE"

As per coding guidelines: "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` around lines 158 - 160, The current cut-based
parsing of PAYLOAD_PROCESSING_IMAGE into IMG_REGISTRY, IMG_REPO, and IMG_TAG is
brittle (variables: PAYLOAD_PROCESSING_IMAGE, IMG_REGISTRY, IMG_REPO, IMG_TAG)
and fails for registries with ports or multi-segment repos; instead stop
splitting by blind cut and either (A) pass the entire PAYLOAD_PROCESSING_IMAGE
to Helm (e.g., set upstreamBbr.bbr.image.fullReference) so the chart receives a
single canonical image string, or (B) implement robust parsing using a single
regex/parameter-expansion routine that extracts optional registry (with possible
:port), repository (including subpaths), and tag/digest and validates the format
before use; also add a defensive check that PAYLOAD_PROCESSING_IMAGE is
non-empty and abort with an error if parsing fails.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our images are always quay.io/opendatahub/repo:tag — no ports, no digests. The Helm chart expects separate registry/repo/tag values, so we need to split. If the format ever changes, the Helm chart itself would need updating too. Not worth over-engineering for the current use case.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yossiovadia, understood — with a fixed quay.io/opendatahub/repo:tag convention and a Helm chart that's co-evolved to expect split values, the current parsing is fit for purpose. I'll drop the suggestion.


✏️ Learnings added
Learnt from: yossiovadia
Repo: opendatahub-io/ai-gateway-payload-processing PR: 266
File: test/e2e/scripts/e2e-ci.sh:171-173
Timestamp: 2026-05-12T19:51:56.100Z
Learning: In `test/e2e/scripts/e2e-ci.sh` (opendatahub-io/ai-gateway-payload-processing), PAYLOAD_PROCESSING_IMAGE always follows the fixed format `quay.io/opendatahub/<repo>:<tag>` — no port registries, no digests, no multi-segment sub-paths. The Helm chart (deploy/payload-processing) intentionally expects separate registry/repository/tag values via --set flags, so the cut-based image splitting (IMG_REGISTRY, IMG_REPO, IMG_TAG) is correct by design. Do not flag this as brittle parsing; if the image format ever changes, the Helm chart itself would need updating too.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

CWE-269: Test pod granted cluster-admin privileges.

The e2e-test-runner ServiceAccount receives cluster-wide admin access. If the E2E test image is compromised or contains vulnerabilities, attackers gain full cluster control. E2E tests should use a scoped Role limited to the test namespace plus specific read permissions.

🔐 Least-privilege RBAC alternative
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: ${E2E_JOB_NAME}
  namespace: ${E2E_NS}
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps", "secrets"]
  verbs: ["get", "list", "watch", "create", "update", "delete"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["maas.opendatahub.io"]
  resources: ["externalmodels"]
  verbs: ["get", "list", "watch", "create", "delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: ${E2E_JOB_NAME}
  namespace: ${E2E_NS}
subjects:
- kind: ServiceAccount
  name: ${E2E_JOB_NAME}
  namespace: ${E2E_NS}
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: ${E2E_JOB_NAME}
---
# If cross-namespace reads are required:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: ${E2E_JOB_NAME}-readonly
rules:
- apiGroups: ["gateway.networking.k8s.io"]
  resources: ["gateways", "httproutes"]
  verbs: ["get", "list"]

As per coding guidelines: "REVIEW PRIORITIES: 1. Security vulnerabilities (provide severity, exploit scenario, and remediation code)."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` at line 223, The script grants cluster-admin to
the e2e-test-runner ServiceAccount (search for the literal "name: cluster-admin"
and references to e2e-test-runner/E2E_JOB_NAME/E2E_NS); replace this with a
namespace-scoped Role and RoleBinding that grant only needed verbs/resources
(pods, services, configmaps, secrets, limited apps/deployments, and the
maas.opendatahub.io externalmodels rules) and bind that Role to the
e2e-test-runner ServiceAccount, and optionally add a separate readonly
ClusterRole (e.g., ${E2E_JOB_NAME}-readonly) only if cross-namespace GET/LIST
access (gateways/httproutes) is required; update the script to apply those
Role/RoleBinding/ClusterRole manifests instead of assigning cluster-admin to the
ServiceAccount.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The E2E test pod creates namespaces, CRDs, Services, HTTPRoutes, ExternalModels, ServiceEntries, DestinationRules across multiple namespaces. Scoping this precisely would be 30+ RBAC rules and fragile to maintain. This runs on an ephemeral cluster that gets destroyed after the test. cluster-admin is the standard pattern for E2E test runners.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yossiovadia, acknowledged. Given the scope (cross-namespace CRDs, ServiceEntries, DestinationRules, HTTPRoutes) and the ephemeral cluster context, the blast-radius argument holds. cluster-admin is reasonable here.


✏️ Learnings added
Learnt from: yossiovadia
Repo: opendatahub-io/ai-gateway-payload-processing PR: 266
File: test/e2e/scripts/e2e-ci.sh:236-236
Timestamp: 2026-05-12T19:52:33.177Z
Learning: In `test/e2e/scripts/e2e-ci.sh` (project: opendatahub-io/ai-gateway-payload-processing), the `e2e-test-runner` ServiceAccount is intentionally granted `cluster-admin`. The E2E tests create resources across multiple namespaces (namespaces, CRDs, Services, HTTPRoutes, ExternalModels, ServiceEntries, DestinationRules). Scoping RBAC precisely would require 30+ rules and be fragile. The tests run on an ephemeral cluster that is destroyed after the test run. Do not flag this as a security issue.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +262 to +267
/e2e/e2e-tests.test \
-test.v -ginkgo.v -test.count=1 \
-test.timeout=${E2E_TEST_TIMEOUT} \
-ginkgo.junit-report=/results/e2e-junit.xml \
${LABEL_FILTER_ARG:+"-ginkgo.label-filter=${E2E_LABEL_FILTER}"}
echo \$? > /results/exit-code
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

LABEL_FILTER_ARG expansion will not work in heredoc.

Line 266 uses ${LABEL_FILTER_ARG:+...} parameter expansion inside a heredoc, but the variable is set to a YAML string (line 236), not a shell argument. This will either produce malformed YAML or fail to apply the label filter.

🐛 Corrected argument handling
-LABEL_FILTER_ARG=""
-if [[ -n "$E2E_LABEL_FILTER" ]]; then
-    LABEL_FILTER_ARG="        - \"-ginkgo.label-filter=$E2E_LABEL_FILTER\""
-fi
-
 # Run tests as a Pod (not Job) — the init container runs the tests and writes
 # the JUnit report to a shared volume. The main container stays alive so we
 # can extract the report via kubectl cp before cleanup.
 kubectl delete pod "$E2E_JOB_NAME" -n "$E2E_NS" --ignore-not-found 2>/dev/null || true

+# Build label filter flag for test command
+LABEL_FILTER_FLAG=""
+if [[ -n "$E2E_LABEL_FILTER" ]]; then
+    LABEL_FILTER_FLAG="-ginkgo.label-filter=${E2E_LABEL_FILTER}"
+fi
+
 cat <<EOF | kubectl apply -f -
 apiVersion: v1
 kind: Pod
@@ -263,8 +265,7 @@
       /e2e/e2e-tests.test \
         -test.v -ginkgo.v -test.count=1 \
         -test.timeout=${E2E_TEST_TIMEOUT} \
-        -ginkgo.junit-report=/results/e2e-junit.xml \
-        ${LABEL_FILTER_ARG:+"-ginkgo.label-filter=${E2E_LABEL_FILTER}"}
+        -ginkgo.junit-report=/results/e2e-junit.xml ${LABEL_FILTER_FLAG}
       echo \$? > /results/exit-code
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` around lines 262 - 267, The heredoc currently
contains parameter expansion
`${LABEL_FILTER_ARG:+"-ginkgo.label-filter=${E2E_LABEL_FILTER}"}` which fails
because LABEL_FILTER_ARG is set to a YAML string earlier and not a shell CLI
fragment; fix by preparing a proper CLI argument variable before the heredoc
(e.g., build LABEL_FILTER_CLI or set LABEL_FILTER_ARG to either an empty string
or the exact `-ginkgo.label-filter="..."` token using E2E_LABEL_FILTER) and then
reference that single precomputed variable inside the heredoc/command; update
the e2e-ci.sh logic that sets LABEL_FILTER_ARG / E2E_LABEL_FILTER so the value
used in the exec line is a valid shell argument (no in-heredoc parameter
expansion of YAML content).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init container uses command: ["/bin/sh", "-c"] with the args as a shell script string. The ${LABEL_FILTER_ARG:+...} expansion happens when the heredoc is evaluated by the outer shell (before kubectl apply), not inside the container. It correctly produces either the flag or nothing. Tested locally with and without the filter set.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yossiovadia, you're correct — I was wrong. The heredoc is unquoted (<<EOF), so the outer shell evaluates ${LABEL_FILTER_ARG:+"-ginkgo.label-filter=${E2E_LABEL_FILTER}"} before the YAML reaches kubectl apply. Since ${VAR:+alt} expands to alt (not $VAR) when VAR is non-empty, the value of LABEL_FILTER_ARG being a YAML fragment is irrelevant — it's only used as a truthiness gate. The correct flag string ends up inline in the /bin/sh -c script string in both cases. Apologies for the noise.

Comment on lines +320 to +322
TEST_EXIT_CODE=$(kubectl exec "$E2E_JOB_NAME" -n "$E2E_NS" -c results \
-- cat /results/exit-code 2>/dev/null || echo "1")
TEST_EXIT=${TEST_EXIT_CODE:-1}
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Exit code extraction may fail silently.

If /results/exit-code is missing or unreadable, the fallback echo "1" masks the failure mode. The script should distinguish between test failure and infrastructure failure.

💡 Improved error handling
 # Check test exit code
-TEST_EXIT_CODE=$(kubectl exec "$E2E_JOB_NAME" -n "$E2E_NS" -c results \
-    -- cat /results/exit-code 2>/dev/null || echo "1")
-TEST_EXIT=${TEST_EXIT_CODE:-1}
+if ! TEST_EXIT_CODE=$(kubectl exec "$E2E_JOB_NAME" -n "$E2E_NS" -c results \
+    -- cat /results/exit-code 2>/dev/null); then
+    echo "  ERROR: Could not read test exit code from pod"
+    TEST_EXIT=1
+else
+    TEST_EXIT="${TEST_EXIT_CODE:-1}"
+fi

As per coding guidelines: "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps."

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
TEST_EXIT_CODE=$(kubectl exec "$E2E_JOB_NAME" -n "$E2E_NS" -c results \
-- cat /results/exit-code 2>/dev/null || echo "1")
TEST_EXIT=${TEST_EXIT_CODE:-1}
if ! TEST_EXIT_CODE=$(kubectl exec "$E2E_JOB_NAME" -n "$E2E_NS" -c results \
-- cat /results/exit-code 2>/dev/null); then
echo " ERROR: Could not read test exit code from pod"
TEST_EXIT=1
else
TEST_EXIT="${TEST_EXIT_CODE:-1}"
fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` around lines 320 - 322, The current extraction of
TEST_EXIT_CODE via kubectl exec with a fallback echo "1" masks
infrastructure/read error as a test failure; update the logic around
TEST_EXIT_CODE and TEST_EXIT to detect kubectl execution failure vs a real
exit-code file: run the kubectl exec invocation and capture its stdout and exit
status separately (the invocation that reads /results/exit-code inside the
container), if kubectl itself fails or the file is missing/unreadable set a
distinct sentinel (e.g., TEST_EXIT_CODE="INFRA_FAILURE" or numeric like 125) and
emit a clear error log, otherwise parse the numeric file into TEST_EXIT; ensure
TEST_EXIT is set from that sentinel/number so callers can differentiate infra
errors from test failures (refer to the variables TEST_EXIT_CODE, TEST_EXIT and
the kubectl exec command in the diff).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If kubectl can't exec into the pod, the fallback is exit code 1 which correctly fails the pipeline. Distinguishing 'test failed' from 'infra failed' is nice but the pipeline treats both the same way — fail and collect must-gather. Not worth the added complexity for a test script.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yossiovadia — fair point. If the pipeline's response to both failure modes is identical, adding the distinction adds complexity without changing behavior. Acknowledged.

The Konflux ephemeral cluster doesn't come with Istio or Gateway API
pre-installed. Instead of failing, install them automatically.

Uses upstream istioctl with inference extension flags, matching the
existing setup-kind.sh pattern.
The curl pod created by the ginkgo BeforeSuite fails to start with
sidecar injection enabled on the ephemeral cluster. Test traffic goes
through the Gateway, not sidecar mTLS, so injection is not needed.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
test/e2e/scripts/e2e-ci.sh (1)

194-200: ⚡ Quick win

Silent kubectl patch failure can mask sidecar-injection regressions.

The || true after the patch swallows every error class — RBAC denials, API server transient errors, a renamed deployment in the chart. If the patch silently fails, the BBR pod comes up with the Istio sidecar injected, and ext_proc's self-signed TLS (per the inline comment) is suddenly subject to Istio mTLS, producing opaque test failures hundreds of lines later in CI logs.

At minimum, check whether the annotation was actually applied; if the chart already exposes a value for the annotation, prefer setting it via --set so it's part of the rendered manifest and not a post-hoc patch.

🔧 Fail loudly instead
-# Disable Istio sidecar on BBR pod (ext_proc uses self-signed TLS)
-kubectl patch deployment payload-processing -n "$GATEWAY_NAMESPACE" --type=merge \
-    -p='{"spec":{"template":{"metadata":{"annotations":{"sidecar.istio.io/inject":"false"}}}}}' \
-    2>/dev/null || true
+# Disable Istio sidecar on BBR pod (ext_proc uses self-signed TLS)
+if ! kubectl patch deployment payload-processing -n "$GATEWAY_NAMESPACE" --type=merge \
+    -p='{"spec":{"template":{"metadata":{"annotations":{"sidecar.istio.io/inject":"false"}}}}}'; then
+    echo "ERROR: failed to disable Istio sidecar injection on payload-processing"
+    exit 1
+fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` around lines 194 - 200, Remove the silent failure
on the kubectl patch and ensure the annotation was actually applied: stop
swallowing errors from the kubectl patch command that targets the
payload-processing deployment (the line invoking "kubectl patch deployment
payload-processing" with GATEWAY_NAMESPACE), and after patching verify the pod
template annotation "sidecar.istio.io/inject":"false" is present (e.g., by
querying the deployment or pod metadata); if the annotation is missing or the
patch failed, exit non‑zero so CI fails fast. Prefer setting the annotation via
chart templating/--set during deployment rather than a post‑deploy patch when
possible, and keep the existing "kubectl rollout status
deployment/payload-processing -n \"$GATEWAY_NAMESPACE\"" check to ensure rollout
completes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/scripts/e2e-ci.sh`:
- Around line 65-78: The EXIT trap currently calls cleanup which can overwrite
the script's original exit status; update the trap so the original status is
captured and re-used (e.g. capture $? into a temp var before running cleanup and
exit with it), or change cleanup to accept the status as an argument and have
the trap call cleanup "$?" and then exit with that status; target the cleanup
function and the trap line (trap cleanup EXIT) and ensure any early exits under
set -e preserve the original exit code when the trap finishes.
- Around line 102-104: The script currently uses $PWD when downloading and
exporting istioctl, which makes the installer path non-deterministic; instead,
create a dedicated deterministic install directory (e.g., a temp or repo-local
dir derived from ISTIO_VERSION like INSTALL_DIR="$(mktemp
-d)/istio-${ISTIO_VERSION}" or a fixed CI artifact dir), pass that directory to
the Istio installer so extraction happens there, and update the PATH export to
export INSTALL_DIR/bin rather than $PWD/istio-${ISTIO_VERSION}/bin; ensure
cleanup or reuse policy is handled and reference ISTIO_VERSION, istioctl, the
curl installer invocation, and the export PATH line when making the change.
- Around line 99-110: The script currently executes the remote installer via
"curl -sL https://istio.io/downloadIstio | ISTIO_VERSION=\"$ISTIO_VERSION\" sh
-" which allows remote code execution; replace this with a safe
download-and-verify flow or bake istioctl into the CI image. Concretely: instead
of piping to sh, download the installer archive to a temp file using curl or
wget, download the corresponding checksum (and signature if available), verify
the archive with sha256sum (and gpg/openssl verify if signature present), only
then extract and add the extracted istio-<ISTIO_VERSION>/bin to PATH before
calling istioctl install; alternatively update the CI image to include istioctl
(so the "if ! command -v istioctl" branch no longer downloads anything). Ensure
you update the code that currently references ISTIO_VERSION and the curl
pipeline so the new verified-download or baked-image approach replaces the
insecure pipeline.

---

Nitpick comments:
In `@test/e2e/scripts/e2e-ci.sh`:
- Around line 194-200: Remove the silent failure on the kubectl patch and ensure
the annotation was actually applied: stop swallowing errors from the kubectl
patch command that targets the payload-processing deployment (the line invoking
"kubectl patch deployment payload-processing" with GATEWAY_NAMESPACE), and after
patching verify the pod template annotation "sidecar.istio.io/inject":"false" is
present (e.g., by querying the deployment or pod metadata); if the annotation is
missing or the patch failed, exit non‑zero so CI fails fast. Prefer setting the
annotation via chart templating/--set during deployment rather than a
post‑deploy patch when possible, and keep the existing "kubectl rollout status
deployment/payload-processing -n \"$GATEWAY_NAMESPACE\"" check to ensure rollout
completes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 562e44a1-4f72-43b3-bda0-98b03be26068

📥 Commits

Reviewing files that changed from the base of the PR and between c5b0e2c and 0614a6c.

📒 Files selected for processing (1)
  • test/e2e/scripts/e2e-ci.sh

Comment on lines +65 to +78
cleanup() {
if [[ "$E2E_CLEANUP" != "true" ]]; then
echo "Skipping cleanup (E2E_CLEANUP=$E2E_CLEANUP)"
return
fi
echo ""
echo "=== Cleanup ==="
kubectl delete job "$E2E_JOB_NAME" -n "$E2E_NS" --ignore-not-found 2>/dev/null || true
helm uninstall "$HELM_RELEASE" -n "$GATEWAY_NAMESPACE" 2>/dev/null || true
kubectl delete namespace "$E2E_NS" --ignore-not-found --timeout=60s 2>/dev/null || true
kubectl delete gateway "$GATEWAY_NAME" -n "$GATEWAY_NAMESPACE" --ignore-not-found 2>/dev/null || true
echo "Cleanup complete"
}
trap cleanup EXIT
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

trap cleanup EXIT clobbers the failing exit code.

If any command between line 80 and line 359 fails under set -e, the EXIT trap fires and cleanup runs kubectl/helm commands whose own exit codes overwrite $?. The script then exits with the trap's last status, not the original failure — so a preflight failure can surface to Konflux as exit 0 if every cleanup step happens to succeed, masking a real CI failure.

🔧 Preserve original exit status
-cleanup() {
+cleanup() {
+    local rc=$?
     if [[ "$E2E_CLEANUP" != "true" ]]; then
         echo "Skipping cleanup (E2E_CLEANUP=$E2E_CLEANUP)"
-        return
+        return "$rc"
     fi
     echo ""
     echo "=== Cleanup ==="
     kubectl delete job "$E2E_JOB_NAME" -n "$E2E_NS" --ignore-not-found 2>/dev/null || true
     helm uninstall "$HELM_RELEASE" -n "$GATEWAY_NAMESPACE" 2>/dev/null || true
     kubectl delete namespace "$E2E_NS" --ignore-not-found --timeout=60s 2>/dev/null || true
     kubectl delete gateway "$GATEWAY_NAME" -n "$GATEWAY_NAMESPACE" --ignore-not-found 2>/dev/null || true
     echo "Cleanup complete"
+    return "$rc"
 }
 trap cleanup EXIT

Note: the explicit exit "$TEST_EXIT" on line 359 is unaffected (bash forwards the argument to the trap via $?), but interim exit 1 paths (lines 87, 122) currently rely on the trap not stepping on them.

🧰 Tools
🪛 Shellcheck (0.11.0)

[info] 65-77: This function is never invoked. Check usage (or ignored if invoked indirectly).

(SC2329)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` around lines 65 - 78, The EXIT trap currently
calls cleanup which can overwrite the script's original exit status; update the
trap so the original status is captured and re-used (e.g. capture $? into a temp
var before running cleanup and exit with it), or change cleanup to accept the
status as an argument and have the trap call cleanup "$?" and then exit with
that status; target the cleanup function and the trap line (trap cleanup EXIT)
and ensure any early exits under set -e preserve the original exit code when the
trap finishes.

Comment on lines +99 to +110
if ! kubectl get deployment istiod -n istio-system >/dev/null 2>&1; then
echo " Istio not found, installing ${ISTIO_VERSION}..."
if ! command -v istioctl &>/dev/null; then
curl -sL https://istio.io/downloadIstio | ISTIO_VERSION="$ISTIO_VERSION" sh -
export PATH="$PWD/istio-${ISTIO_VERSION}/bin:$PATH"
fi
istioctl install --set profile=minimal \
--set values.pilot.env.SUPPORT_GATEWAY_API_INFERENCE_EXTENSION=true \
--set values.pilot.env.ENABLE_GATEWAY_API_INFERENCE_EXTENSION=true \
-y
kubectl rollout status deployment/istiod -n istio-system --timeout=120s
echo " Istio: installed"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

CWE-829: Remote code execution via curl | sh without integrity verification.

Line 102 pipes the output of https://istio.io/downloadIstio directly into sh on the CI runner. A compromised istio.io endpoint, MitM on the TLS connection (cert chain only, no pin), or DNS hijack would result in arbitrary code execution with the CI runner's privileges — leaking secrets, the kubeconfig of the ephemeral cluster, and pipeline credentials. This is materially different from the kubectl apply -f <url> cases already discussed: those apply YAML to an ephemeral cluster, this executes shell on the host runner.

Two viable remediations:

🛡️ Option 1 — pin the installer by checksum
-    if ! command -v istioctl &>/dev/null; then
-        curl -sL https://istio.io/downloadIstio | ISTIO_VERSION="$ISTIO_VERSION" sh -
-        export PATH="$PWD/istio-${ISTIO_VERSION}/bin:$PATH"
-    fi
+    if ! command -v istioctl &>/dev/null; then
+        ISTIO_DIR="$(mktemp -d)"
+        ISTIO_TARBALL="istioctl-${ISTIO_VERSION}-linux-amd64.tar.gz"
+        ISTIO_SHA256="<pin the published sha256 for this version>"
+        curl -fsSL -o "${ISTIO_DIR}/${ISTIO_TARBALL}" \
+            "https://github.com/istio/istio/releases/download/${ISTIO_VERSION}/${ISTIO_TARBALL}"
+        echo "${ISTIO_SHA256}  ${ISTIO_DIR}/${ISTIO_TARBALL}" | sha256sum -c -
+        tar -xzf "${ISTIO_DIR}/${ISTIO_TARBALL}" -C "${ISTIO_DIR}"
+        export PATH="${ISTIO_DIR}:$PATH"
+    fi
🛡️ Option 2 — bake `istioctl` into the CI image

Install istioctl in the runner image at build time (with checksum verification there) and drop this download path entirely. Cleanest option if the Konflux runner image is under your control.

As per coding guidelines: "SHELL SCRIPT SECURITY (CWE-78): … Avoid eval and dynamic command execution."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` around lines 99 - 110, The script currently
executes the remote installer via "curl -sL https://istio.io/downloadIstio |
ISTIO_VERSION=\"$ISTIO_VERSION\" sh -" which allows remote code execution;
replace this with a safe download-and-verify flow or bake istioctl into the CI
image. Concretely: instead of piping to sh, download the installer archive to a
temp file using curl or wget, download the corresponding checksum (and signature
if available), verify the archive with sha256sum (and gpg/openssl verify if
signature present), only then extract and add the extracted
istio-<ISTIO_VERSION>/bin to PATH before calling istioctl install; alternatively
update the CI image to include istioctl (so the "if ! command -v istioctl"
branch no longer downloads anything). Ensure you update the code that currently
references ISTIO_VERSION and the curl pipeline so the new verified-download or
baked-image approach replaces the insecure pipeline.

Comment on lines +102 to +104
curl -sL https://istio.io/downloadIstio | ISTIO_VERSION="$ISTIO_VERSION" sh -
export PATH="$PWD/istio-${ISTIO_VERSION}/bin:$PATH"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

$PWD makes istioctl location non-deterministic.

The Istio installer extracts istio-${ISTIO_VERSION}/ into the current working directory when the script is invoked, then line 103 adds $PWD/istio-${ISTIO_VERSION}/bin to PATH. If the script is ever invoked from a different CWD (a common Konflux pattern when task definitions cd elsewhere), the export will reference a directory the installer didn't populate, and a subsequent istioctl install will fail with command not found — or worse, pick up a stale binary from a prior run that happened to land in $PWD. It also litters the caller's CWD with an istio-* tree.

🔧 Suggested fix
     if ! command -v istioctl &>/dev/null; then
-        curl -sL https://istio.io/downloadIstio | ISTIO_VERSION="$ISTIO_VERSION" sh -
-        export PATH="$PWD/istio-${ISTIO_VERSION}/bin:$PATH"
+        ISTIO_INSTALL_DIR="$(mktemp -d)"
+        ( cd "$ISTIO_INSTALL_DIR" && curl -sL https://istio.io/downloadIstio | ISTIO_VERSION="$ISTIO_VERSION" sh - )
+        export PATH="${ISTIO_INSTALL_DIR}/istio-${ISTIO_VERSION}/bin:$PATH"
     fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` around lines 102 - 104, The script currently uses
$PWD when downloading and exporting istioctl, which makes the installer path
non-deterministic; instead, create a dedicated deterministic install directory
(e.g., a temp or repo-local dir derived from ISTIO_VERSION like
INSTALL_DIR="$(mktemp -d)/istio-${ISTIO_VERSION}" or a fixed CI artifact dir),
pass that directory to the Istio installer so extraction happens there, and
update the PATH export to export INSTALL_DIR/bin rather than
$PWD/istio-${ISTIO_VERSION}/bin; ensure cleanup or reuse policy is handled and
reference ISTIO_VERSION, istioctl, the curl installer invocation, and the export
PATH line when making the change.

The simulator has key validation enabled — invalid/missing keys
return 401. Set E2E_SIMULATOR_VALIDATE_KEYS=true so the ginkgo
tier2 test runs instead of being skipped.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/scripts/e2e-ci.sh`:
- Line 47: The script currently hard-codes a 600s pod wait which can abort while
E2E_TEST_TIMEOUT is longer; change the pod-wait logic to derive its timeout from
the E2E_TEST_TIMEOUT variable (parse it into seconds) and use that value plus a
small buffer (e.g., +30s) instead of the literal 600; update the code paths that
reference the fixed 600 (the pod wait loop / kubectl wait invocation) to use the
computed POD_WAIT_SECS (or similar) so the init container/outer wait honors
E2E_TEST_TIMEOUT.
- Around line 41-50: The script uses static resource names (HELM_RELEASE,
E2E_JOB_NAME, E2E_NS, GATEWAY_NAME) which cause collisions in concurrent runs;
change their defaults to include a unique run suffix (e.g., from CI_BUILD_ID, a
timestamp, or a generated short UUID) and propagate that suffix everywhere these
variables are referenced so every created resource (Helm release, job,
namespace, gateway name) is namespaced per-run; ensure the cleanup logic and any
lookups use the same suffixed variables so one run cannot delete or overwrite
another run's resources.
- Around line 246-249: LABEL_FILTER_ARG injects E2E_LABEL_FILTER unquoted into a
shell command, allowing shell interpretation; change the assignment for
LABEL_FILTER_ARG to produce a properly quoted/escaped ginkgo argument (e.g. emit
-ginkgo.label-filter='VALUE' and escape any single quotes in E2E_LABEL_FILTER by
replacing ' with '\'' or use printf '%q' to safely escape the variable), and
apply the same quoting/escaping fix to the other occurrence that builds the init
container command; ensure the resulting string supplies the filter as a single
literal argument to avoid shell expansion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 943e4327-17e6-4a21-a4e5-119e60d995b3

📥 Commits

Reviewing files that changed from the base of the PR and between 0614a6c and 7660c73.

📒 Files selected for processing (1)
  • test/e2e/scripts/e2e-ci.sh

Comment on lines +41 to +50
GATEWAY_NAMESPACE="${E2E_GATEWAY_NAMESPACE:-istio-system}"
GATEWAY_NAME="${E2E_GATEWAY_NAME:-e2e-gateway}"
E2E_NS="${E2E_NS:-e2e-models}"
E2E_LABEL_FILTER="${E2E_LABEL_FILTER:-}"
E2E_JUNIT_REPORT="${E2E_JUNIT_REPORT:-results/e2e-junit.xml}"
E2E_CLEANUP="${E2E_CLEANUP:-true}"
E2E_TEST_TIMEOUT="${E2E_TEST_TIMEOUT:-10m}"

HELM_RELEASE="payload-processing-e2e"
E2E_JOB_NAME="e2e-test-runner"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Static resource names will make concurrent E2E runs stomp each other.

payload-processing-e2e, e2e-test-runner, e2e-models, and the default e2e-gateway are global names. On a shared Konflux/OpenShift cluster, one run can overwrite another run’s deploys or delete its resources during cleanup.

Suggested fix
-GATEWAY_NAME="${E2E_GATEWAY_NAME:-e2e-gateway}"
-E2E_NS="${E2E_NS:-e2e-models}"
+RUN_SUFFIX_RAW="${E2E_RUN_SUFFIX:-${RANDOM}}"
+RUN_SUFFIX="$(printf '%s' "$RUN_SUFFIX_RAW" | tr '[:upper:]' '[:lower:]' | tr -cs 'a-z0-9-' '-')"
+GATEWAY_NAME="${E2E_GATEWAY_NAME:-e2e-gateway-${RUN_SUFFIX}}"
+E2E_NS="${E2E_NS:-e2e-models-${RUN_SUFFIX}}"
 ...
-HELM_RELEASE="payload-processing-e2e"
-E2E_JOB_NAME="e2e-test-runner"
+HELM_RELEASE="${HELM_RELEASE:-payload-processing-e2e-${RUN_SUFFIX}}"
+E2E_JOB_NAME="${E2E_JOB_NAME:-e2e-test-runner-${RUN_SUFFIX}}"

As per coding guidelines, "REVIEW PRIORITIES: 2. Architectural issues and anti-patterns".

Also applies to: 72-75

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` around lines 41 - 50, The script uses static
resource names (HELM_RELEASE, E2E_JOB_NAME, E2E_NS, GATEWAY_NAME) which cause
collisions in concurrent runs; change their defaults to include a unique run
suffix (e.g., from CI_BUILD_ID, a timestamp, or a generated short UUID) and
propagate that suffix everywhere these variables are referenced so every created
resource (Helm release, job, namespace, gateway name) is namespaced per-run;
ensure the cleanup logic and any lookups use the same suffixed variables so one
run cannot delete or overwrite another run's resources.

E2E_LABEL_FILTER="${E2E_LABEL_FILTER:-}"
E2E_JUNIT_REPORT="${E2E_JUNIT_REPORT:-results/e2e-junit.xml}"
E2E_CLEANUP="${E2E_CLEANUP:-true}"
E2E_TEST_TIMEOUT="${E2E_TEST_TIMEOUT:-10m}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The hard-coded 600s pod wait makes E2E_TEST_TIMEOUT lie.

The init container is allowed to run for ${E2E_TEST_TIMEOUT}, but the outer wait always gives up after 10 minutes. Any run with a larger timeout, or one that uses most of the default budget plus scheduling time, can fall through to the exit-code probe while the test is still running and get reported as failed.

Suggested fix
 E2E_TEST_TIMEOUT="${E2E_TEST_TIMEOUT:-10m}"
+E2E_POD_WAIT_TIMEOUT="${E2E_POD_WAIT_TIMEOUT:-15m}"
 ...
 if ! kubectl wait --for=condition=Ready "pod/${E2E_JOB_NAME}" \
-    -n "$E2E_NS" --timeout=600s 2>/dev/null; then
+    -n "$E2E_NS" --timeout="$E2E_POD_WAIT_TIMEOUT" 2>/dev/null; then

At minimum, the wait should track E2E_TEST_TIMEOUT; ideally it should add a small scheduling buffer.

As per coding guidelines, "REVIEW PRIORITIES: 3. Bug-prone patterns and error handling gaps."

Also applies to: 311-336

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` at line 47, The script currently hard-codes a
600s pod wait which can abort while E2E_TEST_TIMEOUT is longer; change the
pod-wait logic to derive its timeout from the E2E_TEST_TIMEOUT variable (parse
it into seconds) and use that value plus a small buffer (e.g., +30s) instead of
the literal 600; update the code paths that reference the fixed 600 (the pod
wait loop / kubectl wait invocation) to use the computed POD_WAIT_SECS (or
similar) so the init container/outer wait honors E2E_TEST_TIMEOUT.

Comment on lines +246 to +249
LABEL_FILTER_ARG=""
if [[ -n "$E2E_LABEL_FILTER" ]]; then
LABEL_FILTER_ARG=" - \"-ginkgo.label-filter=$E2E_LABEL_FILTER\""
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

CWE-78: E2E_LABEL_FILTER is shell-injected into the init container command.

-ginkgo.label-filter=${E2E_LABEL_FILTER} is emitted into /bin/sh -c unquoted. A normal filter like tier2 && !slow will be parsed by the shell instead of Ginkgo, and a crafted value can execute arbitrary commands inside a pod that has cluster-wide privileges.

Suggested fix
-LABEL_FILTER_ARG=""
-if [[ -n "$E2E_LABEL_FILTER" ]]; then
-    LABEL_FILTER_ARG="        - \"-ginkgo.label-filter=$E2E_LABEL_FILTER\""
-fi
-
 ...
     args:
     - |
-      /e2e/e2e-tests.test \
-        -test.v -ginkgo.v -test.count=1 \
-        -test.timeout=${E2E_TEST_TIMEOUT} \
-        -ginkgo.junit-report=/results/e2e-junit.xml \
-        ${LABEL_FILTER_ARG:+"-ginkgo.label-filter=${E2E_LABEL_FILTER}"}
-      echo \$? > /results/exit-code
+      status=0
+      if [ -n "\${E2E_LABEL_FILTER}" ]; then
+        /e2e/e2e-tests.test \
+          -test.v -ginkgo.v -test.count=1 \
+          -test.timeout=${E2E_TEST_TIMEOUT} \
+          -ginkgo.junit-report=/results/e2e-junit.xml \
+          "-ginkgo.label-filter=\${E2E_LABEL_FILTER}" || status=\$?
+      else
+        /e2e/e2e-tests.test \
+          -test.v -ginkgo.v -test.count=1 \
+          -test.timeout=${E2E_TEST_TIMEOUT} \
+          -ginkgo.junit-report=/results/e2e-junit.xml || status=\$?
+      fi
+      echo "\$status" > /results/exit-code

As per coding guidelines, "SHELL SCRIPT SECURITY (CWE-78): Quote all variables to prevent injection."

Also applies to: 271-279

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/e2e/scripts/e2e-ci.sh` around lines 246 - 249, LABEL_FILTER_ARG injects
E2E_LABEL_FILTER unquoted into a shell command, allowing shell interpretation;
change the assignment for LABEL_FILTER_ARG to produce a properly quoted/escaped
ginkgo argument (e.g. emit -ginkgo.label-filter='VALUE' and escape any single
quotes in E2E_LABEL_FILTER by replacing ' with '\'' or use printf '%q' to safely
escape the variable), and apply the same quoting/escaping fix to the other
occurrence that builds the init container command; ensure the resulting string
supplies the filter as a single literal argument to avoid shell expansion.

@yossiovadia
Copy link
Copy Markdown
Contributor Author

yossiovadia commented May 13, 2026

what it does -

  1. Smoke tests (tier1) — sends a chat completion request to each provider type (OpenAI, Anthropic, Azure, Bedrock, Vertex) through the full gateway stack and expects HTTP 200
  2. Response format (tier1) — verifies the response is valid OpenAI format with choices array and model field for each provider
  3. API key validation (tier2, skipped) — sends invalid API key, expects 401.

So it's testing the IPP plugin chain end-to-end: request comes in → body-field-to-header extracts model name → model-provider-resolver looks up the ExternalModel → api-translation converts the request format → apikey-injection injects the provider's API key → request goes to llm-katan → response comes back through api-translation → client gets OpenAI format response.

It validates that the whole pipeline works — routing, translation, credential injection — for all 5 provider types. Real HTTP traffic through Envoy + ext-proc, not mocks.

Implements https://redhat.atlassian.net/browse/RHOAIENG-62239

@rhods-ci-bot
Copy link
Copy Markdown

@yossiovadia: The following test has Succeeded:

OCI Artifact Browser URL

View in Artifact Browser

Inspecting Test Artifacts Manually

To inspect your test artifacts manually, follow these steps:

  1. Install ORAS (see the ORAS installation guide).
  2. Download artifacts with the following commands:
mkdir -p oras-artifacts
cd oras-artifacts
oras pull quay.io/opendatahub/odh-ci-artifacts:ai-gateway-group-test-7dcm8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants