Skip to content

Commit 154f2f1

Browse files
are-cesRadovan Fuchsclaude
authored
LCORE-1497: Fix RHOAI Prow e2e pipeline failures (#1613)
* bump rhoai image version in prow tests * Fix missing closing brace in LLAMA_STACK_IMAGE parameter expansion Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix prow pipeline: in-cluster image build, RAG config, port-forward fix - Build llama-stack image in OpenShift internal registry via oc new-build/start-build - Add image-puller role for default SA to pull from internal registry - Add FAISS_VECTOR_STORE_ID and KV_RAG_PATH env vars to lightspeed-stack pod - Add inference, byok_rag, and rag sections to prow lightspeed-stack configs - Use envsubst with specific variable scoping in pipeline-services.sh - Fix free_local_tcp_port to only kill LISTEN sockets (was killing behave process) - Add MCP token secrets and empty OpenAI secret to pipeline.sh - Add rlsapi_v1_infer action to prow RBAC config - Simplify llama-stack.yaml to use pre-built image Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add namespace diagnostic logging to prow pipeline Add DEBUG NS checkpoints to trace when e2e-rhoai-dsc namespace disappears during operator bootstrapping. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove namespace from cluster-scoped DataScienceCluster CR Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Add model/provider override env vars to prow pipeline Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix prow pipeline: run bootstrap before namespace creation The RHOAI operator deletes the e2e-rhoai-dsc namespace during DSC reconciliation. Reorder pipeline to run operator bootstrap first, then create namespace and secrets after DSC settles. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Hardcode NAMESPACE to avoid Prow env override Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Re-enable secret exports in prow pipeline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add enrichment and RAG restore to prow llama-stack manifest Rename llama-stack.yaml to llama-stack-prow.yaml and add: - Config enrichment via llama_stack_configuration.py - restore_rag_seed() to re-inflate RAG db after enrichment - PYTHONPATH, lightspeed-stack.yaml mount, rag-data mount - materialize-run-yaml init container - Model/provider overrides in inline_rag e2e tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix prow e2e pipeline: secret ordering, config path, and test runner - Create llama-stack-ip-secret before deploying the pod to fix chicken-and-egg dependency where the pod requires the secret as a non-optional env var - Add LLAMA_STACK_CONFIG env var pointing to the correct emptyDir mount path where materialize-run-yaml init container places run.yaml - Use make test-e2e-local instead of test-e2e to avoid macOS-incompatible script -c flag - Remove DEBUG NS echo lines from pipeline scripts Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix e2e-ops restart failures and mock-jwks port-forward - Replace unfiltered envsubst with sed in e2e-ops.sh restart commands to prevent blanking $VAR references in embedded bash scripts - Add mock-jwks port-forward management (kill/restart/health check) so RBAC and MCP tests don't fail with connection refused on :8000 - Restart mock-jwks port-forward as part of lightspeed restart - Increase vLLM max-model-len from 2048 to 32768 to avoid context length errors with RAG queries Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Improve port-forward resilience for Prow E2E tests verify_connectivity now checks /v1/models returns 200 (not just /readiness) to ensure the app is fully initialized before declaring success. before_scenario in the test framework probes the port-forward before each scenario and auto-restarts it via e2e-ops if dead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix update-configmap cascade failure and surface oc errors Replace fragile oc delete + oc create with oc create --dry-run | oc apply so a failed update leaves the ConfigMap intact instead of deleted. The old approach caused 156 errored scenarios: if create failed after delete succeeded, the ConfigMap was gone and every subsequent update also failed. Also print stdout/stderr from e2e-ops on failure so the actual oc error is visible in test logs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix verify_connectivity for auth-enabled Prow environments On Prow, both /readiness and /v1/models return 401 when auth is enabled. The previous fix only accepted 200 from /v1/models, causing connectivity checks to always fail and port-forward to be declared dead. Accept 401 as valid — it proves the full app stack is running, not just the socket. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix Llama Stack disruption cascade and pipeline port-forward coordination Llama Stack disruption tests left the pod dead after the feature because Behave clears custom context attributes between scenarios, so after_feature never saw llama_stack_was_running=True. This caused 59+ subsequent scenarios to cascade-fail with Connection refused. Three fixes: - Store was_running in module-level state (survives Behave context resets) so after_feature reliably triggers _restore_llama_stack - Add restart-lightspeed fallback in before_scenario when port-forward alone fails (recovers from dead pods, not just dead tunnels) - Align pipeline.sh with pipeline-konflux.sh: export PID file paths for e2e-ops.sh, start Llama Stack port-forward on :8321, and use lsof/fuser fallback for port cleanup in minimal images Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Skip TLS and proxy e2e tests in Prow (no Docker Compose services) TLS and proxy features depend on mock-tls-inference and proxy sidecars that are only deployed via Docker Compose, not in the OpenShift cluster. Every TLS scenario burned 200s waiting for a provider that never exists, consuming ~63 min of the 4h Prow timeout for guaranteed failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix after_feature AttributeError on hostname_llama Behave clears custom context attributes between scenarios, so hostname_llama/port_llama are gone by the time after_feature runs. Store them in module-level state (same pattern as llama_stack_was_running). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Radovan Fuchs <rfuchs@rfuchs-thinkpadp1gen7.tpb.csb> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 93b2bb5 commit 154f2f1

16 files changed

Lines changed: 641 additions & 152 deletions

File tree

tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,13 +22,14 @@ spec:
2222
secretKeyRef:
2323
name: llama-stack-ip-secret
2424
key: key
25-
# Same vars as docker-compose / server-mode YAML (${env.FAISS_VECTOR_STORE_ID} in byok_rag).
2625
- name: FAISS_VECTOR_STORE_ID
2726
valueFrom:
2827
secretKeyRef:
2928
name: faiss-vector-store-secret
3029
key: id
3130
optional: true
31+
- name: KV_RAG_PATH
32+
value: "/app-root/src/.llama/storage/rag/kv_store.db"
3233
image: ${LIGHTSPEED_STACK_IMAGE}
3334
ports:
3435
- containerPort: 8080
Lines changed: 205 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,205 @@
1+
# Llama Stack pod for Prow: uses pre-built image with enrichment + RAG restore.
2+
#
3+
# Requires: ConfigMap llama-stack-config (run.yaml), ConfigMap rag-data (kv_store.db.gz),
4+
# ConfigMap lightspeed-stack-config (lightspeed-stack.yaml).
5+
# Requires: Image built as ${LLAMA_STACK_IMAGE} (set by pipeline.sh).
6+
#
7+
apiVersion: v1
8+
kind: Pod
9+
metadata:
10+
name: llama-stack-service
11+
labels:
12+
pod: llama-stack-service
13+
spec:
14+
securityContext:
15+
seccompProfile:
16+
type: RuntimeDefault
17+
initContainers:
18+
- name: setup-rag-data
19+
image: busybox:latest
20+
securityContext:
21+
allowPrivilegeEscalation: false
22+
capabilities:
23+
drop: ["ALL"]
24+
runAsNonRoot: true
25+
runAsUser: 65534
26+
seccompProfile:
27+
type: RuntimeDefault
28+
command:
29+
- /bin/sh
30+
- -c
31+
- |
32+
set -e
33+
mkdir -p /data/src/.llama/storage/rag /data/src/.llama/storage/files /data/.e2e-rag-seed
34+
if [ ! -f /rag-data/kv_store.db.gz ]; then
35+
echo "FATAL: missing /rag-data/kv_store.db.gz"
36+
ls -la /rag-data || true
37+
exit 1
38+
fi
39+
gunzip -c /rag-data/kv_store.db.gz > /data/.e2e-rag-seed/kv_store.db
40+
cp -f /data/.e2e-rag-seed/kv_store.db /data/src/.llama/storage/rag/kv_store.db
41+
chmod -R 777 /data/src /data/.e2e-rag-seed
42+
echo "RAG data extracted successfully"
43+
volumeMounts:
44+
- name: rag-storage
45+
mountPath: /data
46+
- name: rag-data
47+
mountPath: /rag-data
48+
- name: materialize-run-yaml
49+
image: busybox:latest
50+
securityContext:
51+
allowPrivilegeEscalation: false
52+
capabilities:
53+
drop: ["ALL"]
54+
runAsNonRoot: true
55+
runAsUser: 65534
56+
seccompProfile:
57+
type: RuntimeDefault
58+
command:
59+
- /bin/sh
60+
- -c
61+
- |
62+
set -e
63+
cp /cm/run.yaml /work/run.yaml
64+
chmod 664 /work/run.yaml
65+
volumeMounts:
66+
- name: config-cm
67+
mountPath: /cm
68+
readOnly: true
69+
- name: rag-storage
70+
mountPath: /work
71+
containers:
72+
- name: llama-stack-container
73+
image: ${LLAMA_STACK_IMAGE}
74+
securityContext:
75+
allowPrivilegeEscalation: false
76+
capabilities:
77+
drop: ["ALL"]
78+
runAsNonRoot: true
79+
runAsUser: 1001
80+
seccompProfile:
81+
type: RuntimeDefault
82+
workingDir: /opt/app-root
83+
env:
84+
- name: PYTHONPATH
85+
value: "/opt/app-root/src"
86+
- name: HOME
87+
value: "/opt/app-root/src"
88+
- name: KV_STORE_PATH
89+
value: "/opt/app-root/src/.llama/storage/kv_store.db"
90+
- name: KV_RAG_PATH
91+
value: "/opt/app-root/src/.llama/storage/rag/kv_store.db"
92+
- name: SQL_STORE_PATH
93+
value: "/opt/app-root/src/.llama/storage/sql_store.db"
94+
- name: KSVC_URL
95+
valueFrom:
96+
secretKeyRef:
97+
name: api-url-secret
98+
key: key
99+
- name: VLLM_API_KEY
100+
valueFrom:
101+
secretKeyRef:
102+
name: vllm-api-key-secret
103+
key: key
104+
- name: INFERENCE_MODEL
105+
value: "meta-llama/Llama-3.1-8B-Instruct"
106+
- name: OPENAI_API_KEY
107+
valueFrom:
108+
secretKeyRef:
109+
name: openai-api-key-secret
110+
key: key
111+
optional: true
112+
- name: E2E_OPENAI_MODEL
113+
value: "gpt-4o-mini"
114+
- name: LLAMA_STACK_CONFIG
115+
value: "/opt/app-root/src/.llama/storage/run.yaml"
116+
- name: FAISS_VECTOR_STORE_ID
117+
valueFrom:
118+
secretKeyRef:
119+
name: faiss-vector-store-secret
120+
key: id
121+
- name: E2E_LLAMA_HOSTNAME
122+
valueFrom:
123+
secretKeyRef:
124+
name: llama-stack-ip-secret
125+
key: key
126+
command:
127+
- /bin/bash
128+
- -c
129+
- |
130+
set -e
131+
RAG_SEED="/opt/app-root/src/.llama/storage/.e2e-rag-seed/kv_store.db"
132+
RAG_CM_GZ="/opt/app-root/rag-data-cm/kv_store.db.gz"
133+
RAG_WORK="${KV_RAG_PATH:-/opt/app-root/src/.llama/storage/rag/kv_store.db}"
134+
restore_rag_seed() {
135+
mkdir -p "$(dirname "$RAG_WORK")"
136+
if [[ -f "$RAG_CM_GZ" ]]; then
137+
RAG_WORK="$RAG_WORK" RAG_CM_GZ="$RAG_CM_GZ" python3 -c 'import gzip, os, shutil, sys; r, g = os.environ["RAG_WORK"], os.environ["RAG_CM_GZ"]; t = r + ".tmp"; i = gzip.open(g, "rb"); o = open(t, "wb"); shutil.copyfileobj(i, o); i.close(); o.close(); sz = os.path.getsize(t); (sz >= 1048576) or (print("FATAL: RAG from ConfigMap too small:", sz, file=sys.stderr) or sys.exit(1)); os.replace(t, r); os.chmod(r, 0o664)' || exit 1
138+
elif [[ -f "$RAG_SEED" ]]; then
139+
cp -f "$RAG_SEED" "$RAG_WORK"
140+
chmod 664 "$RAG_WORK" 2>/dev/null || true
141+
fi
142+
}
143+
restore_rag_seed
144+
INPUT_CONFIG="${LLAMA_STACK_CONFIG:-/opt/app-root/run.yaml}"
145+
ENRICHED_CONFIG="/opt/app-root/run.yaml"
146+
LIGHTSPEED_CONFIG="${LIGHTSPEED_CONFIG:-/opt/app-root/lightspeed-stack.yaml}"
147+
ENV_FILE="/opt/app-root/.env"
148+
if [[ -f "$LIGHTSPEED_CONFIG" ]]; then
149+
echo "Enriching llama-stack config..."
150+
ENRICHMENT_FAILED=0
151+
python3 /opt/app-root/src/llama_stack_configuration.py \
152+
-c "$LIGHTSPEED_CONFIG" \
153+
-i "$INPUT_CONFIG" \
154+
-o "$ENRICHED_CONFIG" \
155+
-e "$ENV_FILE" 2>&1 || ENRICHMENT_FAILED=1
156+
if [[ -f "$ENV_FILE" ]]; then
157+
set -a && . "$ENV_FILE" && set +a
158+
fi
159+
if [[ -f "$ENRICHED_CONFIG" ]] && [[ "$ENRICHMENT_FAILED" -eq 0 ]]; then
160+
echo "Using enriched config: $ENRICHED_CONFIG"
161+
restore_rag_seed
162+
exec llama stack run "$ENRICHED_CONFIG"
163+
fi
164+
fi
165+
echo "Using original config: $INPUT_CONFIG"
166+
restore_rag_seed
167+
exec llama stack run "$INPUT_CONFIG"
168+
ports:
169+
- containerPort: 8321
170+
readinessProbe:
171+
httpGet:
172+
path: /v1/health
173+
port: 8321
174+
initialDelaySeconds: 20
175+
periodSeconds: 5
176+
failureThreshold: 36
177+
livenessProbe:
178+
httpGet:
179+
path: /v1/health
180+
port: 8321
181+
initialDelaySeconds: 120
182+
periodSeconds: 20
183+
failureThreshold: 3
184+
volumeMounts:
185+
- name: rag-storage
186+
mountPath: /opt/app-root/src/.llama/storage
187+
- name: lightspeed-config
188+
mountPath: /opt/app-root/lightspeed-stack.yaml
189+
subPath: lightspeed-stack.yaml
190+
readOnly: true
191+
- name: rag-data
192+
mountPath: /opt/app-root/rag-data-cm
193+
readOnly: true
194+
volumes:
195+
- name: rag-storage
196+
emptyDir: {}
197+
- name: config-cm
198+
configMap:
199+
name: llama-stack-config
200+
- name: lightspeed-config
201+
configMap:
202+
name: lightspeed-stack-config
203+
- name: rag-data
204+
configMap:
205+
name: rag-data

tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack.yaml

Lines changed: 0 additions & 62 deletions
This file was deleted.

tests/e2e-prow/rhoai/manifests/operators/ds-cluster.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,6 @@ apiVersion: datasciencecluster.opendatahub.io/v1
22
kind: DataScienceCluster
33
metadata:
44
name: default-dsc
5-
namespace: e2e-rhoai-dsc
65
spec:
76
serviceMesh:
87
managementState: Managed

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-cpu.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ spec:
2424
- --port
2525
- "8080"
2626
- --max-model-len
27-
- "2048"
27+
- "32768"
2828
image: quay.io/rh-ee-cpompeia/vllm-cpu:latest
2929
name: kserve-container
3030
env:

tests/e2e-prow/rhoai/manifests/vllm/vllm-runtime-gpu.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ spec:
2424
- --port
2525
- "8080"
2626
- --max-model-len
27-
- "2048"
27+
- "32768"
2828
- --gpu-memory-utilization
2929
- "0.9"
3030
image: ${VLLM_IMAGE}
Lines changed: 17 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,30 @@
11
#!/bin/bash
22

33
BASE_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
4+
NAMESPACE="${NAMESPACE:-e2e-rhoai-dsc}"
45

5-
# Deploy llama-stack
6-
envsubst < "$BASE_DIR/manifests/lightspeed/llama-stack.yaml" | oc apply -f -
6+
# Create llama-stack-ip-secret before deploying the pod (it references the secret as an env var)
7+
export E2E_LLAMA_HOSTNAME="llama-stack-service-svc.${NAMESPACE}.svc.cluster.local"
8+
oc create secret generic llama-stack-ip-secret \
9+
--from-literal=key="$E2E_LLAMA_HOSTNAME" \
10+
-n "$NAMESPACE" 2>/dev/null || echo "Secret llama-stack-ip-secret exists"
11+
12+
# Deploy llama-stack (substitute only LLAMA_STACK_IMAGE, leave other ${} intact)
13+
envsubst '${LLAMA_STACK_IMAGE}' < "$BASE_DIR/manifests/lightspeed/llama-stack-prow.yaml" | oc apply -n "$NAMESPACE" -f -
714

815
oc wait pod/llama-stack-service \
9-
-n e2e-rhoai-dsc --for=condition=Ready --timeout=600s
16+
-n "$NAMESPACE" --for=condition=Ready --timeout=600s
1017

11-
# Get url address of llama-stack pod
12-
oc label pod llama-stack-service pod=llama-stack-service -n e2e-rhoai-dsc
18+
# Expose llama-stack service
19+
oc label pod llama-stack-service pod=llama-stack-service -n "$NAMESPACE"
1320

1421
oc expose pod llama-stack-service \
1522
--name=llama-stack-service-svc \
1623
--port=8321 \
1724
--type=ClusterIP \
18-
-n e2e-rhoai-dsc
19-
20-
export E2E_LLAMA_HOSTNAME="llama-stack-service-svc.e2e-rhoai-dsc.svc.cluster.local"
21-
22-
oc create secret generic llama-stack-ip-secret \
23-
--from-literal=key="$E2E_LLAMA_HOSTNAME" \
24-
-n e2e-rhoai-dsc || echo "Secret exists"
25+
-n "$NAMESPACE"
2526

26-
# Deploy lightspeed-stack
27-
oc apply -f "$BASE_DIR/manifests/lightspeed/lightspeed-stack.yaml"
27+
# Deploy lightspeed-stack (substitute only LIGHTSPEED_STACK_IMAGE, leave other ${} intact)
28+
LIGHTSPEED_STACK_IMAGE="${LIGHTSPEED_STACK_IMAGE:-quay.io/lightspeed-core/lightspeed-stack:dev-latest}"
29+
export LIGHTSPEED_STACK_IMAGE
30+
envsubst '${LIGHTSPEED_STACK_IMAGE}' < "$BASE_DIR/manifests/lightspeed/lightspeed-stack.yaml" | oc apply -n "$NAMESPACE" -f -

0 commit comments

Comments
 (0)