LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI#2028
LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI#2028are-ces wants to merge 3 commits into
Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
WalkthroughAdds a new Tekton pipeline ( ChangesRHEL AI Konflux Integration Pipeline
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml:
- Around line 395-399: The lightspeed-stack repo/revision selection in the
Tekton test script is being overwritten by a hardcoded fork and branch, so the
pipeline ignores the SNAPSHOT-derived values. Remove the unconditional REPO_URL
and REPO_REV reassignment in the lightspeed-stack test step and keep using the
values parsed from SNAPSHOT in that block, leaving any temporary override behind
the existing TODO only if it is explicitly gated for local use. Reference the
REPO_URL and REPO_REV assignments in the lightspeed-stack pipeline step when
updating this logic.
- Around line 112-171: In the spot provisioning path of the Tekton step, the
exit status of `mapt aws rhel-ai create` is not checked, so failures can fall
through and emit empty results. Update the spot branch in the shell block to
guard the `mapt aws rhel-ai create` call the same way the on-demand path uses
`CREATED`, and fail fast with a clear error if creation does not succeed. Keep
the fix localized around the existing `if [[ "$(params.spot)" == "true" ]]`
branch and the subsequent result-writing commands so `host` and `vllm-api-key`
are only written after a successful create.
- Around line 348-349: Remove the onError: continue setting from the
run-e2e-tests task so failures are not masked when PIPELINE_EXIT is non-zero.
Update the task definition in lightspeed-stack-rhelai-test.yaml for
run-e2e-tests, and keep destroy-rhelai in finally as the cleanup path so the
pipeline correctly fails on e2e errors.
In `@tests/e2e-prow/rhoai/pipeline-konflux.sh`:
- Line 54: The OPENAI_API_KEY check in pipeline-konflux.sh is incorrectly tied
to log()’s return value, so `QUIET=1` can trigger the failure path even when the
key exists. Update the validation near the OPENAI_API_KEY guard to use an
explicit conditional instead of `&& ... || ...`, and keep the existence check
separate from the `log` side effect so `log()` cannot influence the exit
behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: aa3c3a80-cd0a-4ff3-8b89-20dfd359c0dd
📒 Files selected for processing (7)
.tekton/integration-tests/README.md.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yamltests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yamltests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yamltests/e2e-prow/rhoai/pipeline-konflux.shtests/e2e/configs/run-rhelai.yamltests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml
📜 Review details
⏰ Context from checks skipped due to timeout. (12)
- GitHub Check: build-pr
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-0-6-on-pull-request
- GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
- GitHub Check: E2E: server mode / ci / group 1
- GitHub Check: E2E: library mode / ci / group 2
- GitHub Check: E2E: library mode / ci / group 1
- GitHub Check: E2E: server mode / ci / group 2
- GitHub Check: E2E: server mode / ci / group 3
- GitHub Check: E2E: library mode / ci / group 3
- GitHub Check: E2E Tests for Lightspeed Evaluation job
- GitHub Check: integration_tests (3.12)
- GitHub Check: integration_tests (3.13)
⚠️ CI failures not shown inline (4)
GitHub Actions: OpenAPI (Spectral) / 0_spectral.txt: LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI
Conclusion: failure
##[group]Run set -euo pipefail
�[36;1mset -euo pipefail�[0m
�[36;1muv run python scripts/generate_openapi_schema.py /tmp/openapi-generated.json�[0m
�[36;1mif ! diff -u docs/openapi.json /tmp/openapi-generated.json; then�[0m
�[36;1m echo "::error::docs/openapi.json is out of date. Regenerate with: uv run scripts/generate_openapi_schema.py docs/openapi.json"�[0m
GitHub Actions: OpenAPI (Spectral) / spectral: LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI
Conclusion: failure
##[group]Run set -euo pipefail
�[36;1mset -euo pipefail�[0m
�[36;1muv run python scripts/generate_openapi_schema.py /tmp/openapi-generated.json�[0m
�[36;1mif ! diff -u docs/openapi.json /tmp/openapi-generated.json; then�[0m
�[36;1m echo "::error::docs/openapi.json is out of date. Regenerate with: uv run scripts/generate_openapi_schema.py docs/openapi.json"�[0m
GitHub Actions: Unit tests / 1_unit_tests (3.13).txt: LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI
Conclusion: failure
##[group]Run uv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing
�[36;1muv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing�[0m
shell: /usr/bin/bash -e {0}
env:
UV_PYTHON: 3.13
VIRTUAL_ENV: /home/runner/work/lightspeed-stack/lightspeed-stack/.venv
UV_CACHE_DIR: /home/runner/work/_temp/setup-uv-cache
##[endgroup]
Uninstalled 1 package in 3ms
Installed 1 package in 3ms
============================= test session starts ==============================
platform linux -- Python 3.13.14, pytest-9.1.1, pluggy-1.6.0
benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/runner/work/lightspeed-stack/lightspeed-stack
configfile: pyproject.toml
plugins: asyncio-1.4.0, benchmark-5.2.3, anyio-4.14.1, order-1.5.0, mock-3.15.1, cov-7.1.0, logfire-4.37.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2925 items
tests/unit/a2a_storage/test_in_memory_context_store.py ........ [ 0%]
tests/unit/a2a_storage/test_sqlite_context_store.py .......... [ 0%]
tests/unit/a2a_storage/test_storage_factory.py ........... [ 0%]
tests/unit/app/endpoints/test_a2a.py .............................. [ 2%]
tests/unit/app/endpoints/test_authorized.py ... [ 2%]
tests/unit/app/endpoints/test_config.py .. [ 2%]
tests/unit/app/endpoints/test_conversations.py ......................... [ 3%]
................. [ 3%]
tests/unit/app/endpoints/test_conversations_v2.py ...................... [ 4%]
............... [ 4%]
tests/unit/app/endpoints/test_feedback.py ....................... [ 5%]
tests/unit/ap...
GitHub Actions: Unit tests / unit_tests (3.13): LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI
Conclusion: failure
##[group]Run uv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing
�[36;1muv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing�[0m
shell: /usr/bin/bash -e {0}
env:
UV_PYTHON: 3.13
VIRTUAL_ENV: /home/runner/work/lightspeed-stack/lightspeed-stack/.venv
UV_CACHE_DIR: /home/runner/work/_temp/setup-uv-cache
##[endgroup]
Uninstalled 1 package in 3ms
Installed 1 package in 3ms
============================= test session starts ==============================
platform linux -- Python 3.13.14, pytest-9.1.1, pluggy-1.6.0
benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /home/runner/work/lightspeed-stack/lightspeed-stack
configfile: pyproject.toml
plugins: asyncio-1.4.0, benchmark-5.2.3, anyio-4.14.1, order-1.5.0, mock-3.15.1, cov-7.1.0, logfire-4.37.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2925 items
tests/unit/a2a_storage/test_in_memory_context_store.py ........ [ 0%]
tests/unit/a2a_storage/test_sqlite_context_store.py .......... [ 0%]
tests/unit/a2a_storage/test_storage_factory.py ........... [ 0%]
tests/unit/app/endpoints/test_a2a.py .............................. [ 2%]
tests/unit/app/endpoints/test_authorized.py ... [ 2%]
tests/unit/app/endpoints/test_config.py .. [ 2%]
tests/unit/app/endpoints/test_conversations.py ......................... [ 3%]
................. [ 3%]
tests/unit/app/endpoints/test_conversations_v2.py ...................... [ 4%]
............... [ 4%]
tests/unit/app/endpoints/test_feedback.py ....................... [ 5%]
tests/unit/ap...
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2026-02-19T10:06:50.647Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:50.647Z
Learning: In the rhoai tests under tests/e2e-prow/rhoai/manifests, avoid static ConfigMap definitions for mock-jwks-script and mcp-mock-server-script since these ConfigMaps are created dynamically by the pipeline.sh deployment script using 'oc create configmap'. Ensure there are no static ConfigMap resources for these names in the manifests. If such ConfigMaps are added in the future, coordinate with the pipeline to reflect dynamic creation or adjust tests to rely on the dynamic provisioning.
Applied to files:
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yamltests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml
📚 Learning: 2026-05-20T08:09:30.641Z
Learnt from: max-svistunov
Repo: lightspeed-core/lightspeed-stack PR: 1580
File: docs/design/llama-stack-config-merge/poc-results/library-mode/synthesized-run.yaml:107-110
Timestamp: 2026-05-20T08:09:30.641Z
Learning: In Llama-stack config YAMLs, when defining a Llama Guard safety shield entry, set `provider_shield_id` to the *guard model identifier* (e.g., `meta-llama/Llama-Guard-3-8B`). Do not use a chat/generative model id (e.g., `openai/gpt-4o-mini`): a chat-model id (or `native_override`) indicates only an override landed and does **not** mean the safety shield is actually gating queries. Ensure any E2E coverage for the related implementation (JIRA/E2E tests) exercises a real Llama Guard model to verify that the shield is effective.
Applied to files:
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yamltests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yamltests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yamltests/e2e/configs/run-rhelai.yaml
🪛 markdownlint-cli2 (0.22.1)
.tekton/integration-tests/README.md
[warning] 38-38: Files should end with a single newline character
(MD047, single-trailing-newline)
🔇 Additional comments (6)
tests/e2e/configs/run-rhelai.yaml (1)
24-32: LGTM!tests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml (1)
21-35: LGTM!tests/e2e-prow/rhoai/pipeline-konflux.sh (1)
75-79: LGTM!Also applies to: 385-391
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml (1)
33-38: LGTM!tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml (1)
146-166: LGTM!.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml (1)
90-91: 🩺 Stability & Availability
#!/bin/shis fine for this image.ghcr.io/redhat-developer/mapt:pr-848is based on UBI 9, so the shell supports the[[ ... ]]andpipefailusage here.> Likely an incorrect or invalid review comment.
| if [[ "$(params.spot)" == "true" ]]; then | ||
| export AWS_DEFAULT_REGION="us-east-1" | ||
| echo "[mapt] Using spot instances (searching all regions)..." | ||
| mapt aws rhel-ai create \ | ||
| --project-name "mapt-rhel-ai-${RUN_ID}" \ | ||
| --backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \ | ||
| --conn-details-output /opt/host-info \ | ||
| --compute-sizes "$(params.instance-type)" \ | ||
| --version "$(params.rhelai-version)" \ | ||
| ${SPOT_ARGS} \ | ||
| --auto-start \ | ||
| --model "$(params.model)" \ | ||
| --hf-token "${HF_TOKEN}" \ | ||
| --api-key "${VLLM_API_KEY}" \ | ||
| --expose-ports 8000 \ | ||
| --vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \ | ||
| --tags "project=lightspeed-core,environment=konflux-ci" | ||
| else | ||
| REGIONS="us-east-1 us-east-2 us-west-2 eu-west-1 eu-central-1 ap-northeast-1" | ||
| TIMEOUT=600 | ||
| CREATED=0 | ||
|
|
||
| for REGION in $REGIONS; do | ||
| echo "[mapt] Trying on-demand in ${REGION}..." | ||
| export AWS_DEFAULT_REGION="$REGION" | ||
|
|
||
| if timeout $TIMEOUT mapt aws rhel-ai create \ | ||
| --project-name "mapt-rhel-ai-${RUN_ID}" \ | ||
| --backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \ | ||
| --conn-details-output /opt/host-info \ | ||
| --compute-sizes "$(params.instance-type)" \ | ||
| --version "$(params.rhelai-version)" \ | ||
| --auto-start \ | ||
| --model "$(params.model)" \ | ||
| --hf-token "${HF_TOKEN}" \ | ||
| --api-key "${VLLM_API_KEY}" \ | ||
| --expose-ports 8000 \ | ||
| --vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \ | ||
| --tags "project=lightspeed-core,environment=konflux-ci"; then | ||
| CREATED=1 | ||
| break | ||
| fi | ||
|
|
||
| echo "[mapt] Failed in ${REGION}, cleaning up and trying next..." | ||
| mapt aws rhel-ai destroy \ | ||
| --project-name "mapt-rhel-ai-${RUN_ID}" \ | ||
| --backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \ | ||
| --force-destroy 2>/dev/null || true | ||
| done | ||
|
|
||
| if [ "$CREATED" -ne 1 ]; then | ||
| echo "[mapt] ERROR: Failed to create instance in any region" | ||
| exit 1 | ||
| fi | ||
| fi | ||
|
|
||
| echo "[mapt] Instance created and vLLM started." | ||
| echo -n "${VLLM_API_KEY}" > $(results.vllm-api-key.path) | ||
| echo -n "$(cat /opt/host-info/host)" > $(results.host.path) | ||
| echo -n "$(cat /opt/host-info/username)" > $(results.username.path) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Spot provisioning failures are not detected.
The script runs with set -uo pipefail (no -e). In the spot branch the exit status of mapt aws rhel-ai create is never checked, unlike the on-demand branch which uses the CREATED guard. If spot creation fails, execution falls through to Lines 168–171, where cat /opt/host-info/host fails (ignored, no -e) and empty host/vllm-api-key results are emitted, causing the e2e task to run against a non-existent endpoint instead of failing fast.
🐛 Proposed fix (spot branch)
mapt aws rhel-ai create \
...
- --tags "project=lightspeed-core,environment=konflux-ci"
+ --tags "project=lightspeed-core,environment=konflux-ci" || {
+ echo "[mapt] ERROR: spot creation failed"; exit 1; }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [[ "$(params.spot)" == "true" ]]; then | |
| export AWS_DEFAULT_REGION="us-east-1" | |
| echo "[mapt] Using spot instances (searching all regions)..." | |
| mapt aws rhel-ai create \ | |
| --project-name "mapt-rhel-ai-${RUN_ID}" \ | |
| --backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \ | |
| --conn-details-output /opt/host-info \ | |
| --compute-sizes "$(params.instance-type)" \ | |
| --version "$(params.rhelai-version)" \ | |
| ${SPOT_ARGS} \ | |
| --auto-start \ | |
| --model "$(params.model)" \ | |
| --hf-token "${HF_TOKEN}" \ | |
| --api-key "${VLLM_API_KEY}" \ | |
| --expose-ports 8000 \ | |
| --vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \ | |
| --tags "project=lightspeed-core,environment=konflux-ci" | |
| else | |
| REGIONS="us-east-1 us-east-2 us-west-2 eu-west-1 eu-central-1 ap-northeast-1" | |
| TIMEOUT=600 | |
| CREATED=0 | |
| for REGION in $REGIONS; do | |
| echo "[mapt] Trying on-demand in ${REGION}..." | |
| export AWS_DEFAULT_REGION="$REGION" | |
| if timeout $TIMEOUT mapt aws rhel-ai create \ | |
| --project-name "mapt-rhel-ai-${RUN_ID}" \ | |
| --backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \ | |
| --conn-details-output /opt/host-info \ | |
| --compute-sizes "$(params.instance-type)" \ | |
| --version "$(params.rhelai-version)" \ | |
| --auto-start \ | |
| --model "$(params.model)" \ | |
| --hf-token "${HF_TOKEN}" \ | |
| --api-key "${VLLM_API_KEY}" \ | |
| --expose-ports 8000 \ | |
| --vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \ | |
| --tags "project=lightspeed-core,environment=konflux-ci"; then | |
| CREATED=1 | |
| break | |
| fi | |
| echo "[mapt] Failed in ${REGION}, cleaning up and trying next..." | |
| mapt aws rhel-ai destroy \ | |
| --project-name "mapt-rhel-ai-${RUN_ID}" \ | |
| --backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \ | |
| --force-destroy 2>/dev/null || true | |
| done | |
| if [ "$CREATED" -ne 1 ]; then | |
| echo "[mapt] ERROR: Failed to create instance in any region" | |
| exit 1 | |
| fi | |
| fi | |
| echo "[mapt] Instance created and vLLM started." | |
| echo -n "${VLLM_API_KEY}" > $(results.vllm-api-key.path) | |
| echo -n "$(cat /opt/host-info/host)" > $(results.host.path) | |
| echo -n "$(cat /opt/host-info/username)" > $(results.username.path) | |
| if [[ "$(params.spot)" == "true" ]]; then | |
| export AWS_DEFAULT_REGION="us-east-1" | |
| echo "[mapt] Using spot instances (searching all regions)..." | |
| mapt aws rhel-ai create \ | |
| --project-name "mapt-rhel-ai-${RUN_ID}" \ | |
| --backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \ | |
| --conn-details-output /opt/host-info \ | |
| --compute-sizes "$(params.instance-type)" \ | |
| --version "$(params.rhelai-version)" \ | |
| ${SPOT_ARGS} \ | |
| --auto-start \ | |
| --model "$(params.model)" \ | |
| --hf-token "${HF_TOKEN}" \ | |
| --api-key "${VLLM_API_KEY}" \ | |
| --expose-ports 8000 \ | |
| --vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \ | |
| --tags "project=lightspeed-core,environment=konflux-ci" || { | |
| echo "[mapt] ERROR: spot creation failed"; exit 1; } | |
| else | |
| REGIONS="us-east-1 us-east-2 us-west-2 eu-west-1 eu-central-1 ap-northeast-1" | |
| TIMEOUT=600 | |
| CREATED=0 | |
| for REGION in $REGIONS; do | |
| echo "[mapt] Trying on-demand in ${REGION}..." | |
| export AWS_DEFAULT_REGION="$REGION" | |
| if timeout $TIMEOUT mapt aws rhel-ai create \ | |
| --project-name "mapt-rhel-ai-${RUN_ID}" \ | |
| --backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \ | |
| --conn-details-output /opt/host-info \ | |
| --compute-sizes "$(params.instance-type)" \ | |
| --version "$(params.rhelai-version)" \ | |
| --auto-start \ | |
| --model "$(params.model)" \ | |
| --hf-token "${HF_TOKEN}" \ | |
| --api-key "${VLLM_API_KEY}" \ | |
| --expose-ports 8000 \ | |
| --vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \ | |
| --tags "project=lightspeed-core,environment=konflux-ci"; then | |
| CREATED=1 | |
| break | |
| fi | |
| echo "[mapt] Failed in ${REGION}, cleaning up and trying next..." | |
| mapt aws rhel-ai destroy \ | |
| --project-name "mapt-rhel-ai-${RUN_ID}" \ | |
| --backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \ | |
| --force-destroy 2>/dev/null || true | |
| done | |
| if [ "$CREATED" -ne 1 ]; then | |
| echo "[mapt] ERROR: Failed to create instance in any region" | |
| exit 1 | |
| fi | |
| fi | |
| echo "[mapt] Instance created and vLLM started." | |
| echo -n "${VLLM_API_KEY}" > $(results.vllm-api-key.path) | |
| echo -n "$(cat /opt/host-info/host)" > $(results.host.path) | |
| echo -n "$(cat /opt/host-info/username)" > $(results.username.path) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml around
lines 112 - 171, In the spot provisioning path of the Tekton step, the exit
status of `mapt aws rhel-ai create` is not checked, so failures can fall through
and emit empty results. Update the spot branch in the shell block to guard the
`mapt aws rhel-ai create` call the same way the on-demand path uses `CREATED`,
and fail fast with a clear error if creation does not succeed. Keep the fix
localized around the existing `if [[ "$(params.spot)" == "true" ]]` branch and
the subsequent result-writing commands so `host` and `vllm-api-key` are only
written after a successful create.
| - name: run-e2e-tests | ||
| onError: continue |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
git ls-files '.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml'
wc -l .tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml
sed -n '300,420p' .tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml
rg -n 'run-e2e-tests|destroy-rhelai|finally|onError|exit \$PIPELINE_EXIT' .tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yamlRepository: lightspeed-core/lightspeed-stack
Length of output: 5730
🌐 Web query:
Tekton step onError continue semantics task pipeline run failure continue step exits non-zero
💡 Result:
In Tekton, the onError field allows you to control the behavior of a Task when a step exits with a non-zero exit code [1][2]. By default, if a step fails, the remaining steps in the Task are skipped and the TaskRun is marked as a failure [1][3]. When you set onError: continue for a step, the following semantics apply: 1. Step Execution: If the step fails (i.e., exits with a non-zero code), Tekton does not stop the TaskRun [1][2]. Instead, it proceeds to execute the subsequent steps in the Task [1][4]. 2. Exit Code Visibility: Although the step failure is ignored for the purpose of halting the TaskRun, the original exit code is preserved [1][2]. You can access this exit code by reading the file specified by the path variable
Citations:
- 1: https://tekton.dev/docs/pipelines/tasks/
- 2: https://github.com/tektoncd/community/blob/main/teps/0040-ignore-step-errors.md
- 3: https://github.com/tektoncd/pipeline/blob/main/docs/tasks.md
- 4: TEP-0040 implementation - specifying
onErrorin a step tektoncd/pipeline#4106 - 5: https://github.com/tektoncd/pipeline/blob/main/docs/developers/taskruns.md
- 6: https://github.com/tektoncd/pipeline/blob/release-v1.3.x/examples/v1/taskruns/ignore-step-error.yaml
- 7: https://tekton.dev/docs/pipelines/debug/
- 8: https://github.com/tektoncd/pipeline/blob/main/docs/pipelines.md
Remove onError: continue from run-e2e-tests onError: continue keeps the task/pipeline green even when exit $PIPELINE_EXIT is non-zero. destroy-rhelai already runs in finally, so this flag only masks e2e failures.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml around
lines 348 - 349, Remove the onError: continue setting from the run-e2e-tests
task so failures are not masked when PIPELINE_EXIT is non-zero. Update the task
definition in lightspeed-stack-rhelai-test.yaml for run-e2e-tests, and keep
destroy-rhelai in finally as the cleanup path so the pipeline correctly fails on
e2e errors.
| REPO_URL=$(jq -r '.components[] | select(.name == "lightspeed-stack") | .source.git.url // "https://github.com/lightspeed-core/lightspeed-stack.git"' <<< "$SNAPSHOT") | ||
| REPO_REV=$(jq -r '.components[] | select(.name == "lightspeed-stack") | .source.git.revision // "main"' <<< "$SNAPSHOT") | ||
| # TODO: remove branch override once merged to main | ||
| REPO_URL="https://github.com/are-ces/lightspeed-stack.git" | ||
| REPO_REV="rhelai-konflux" |
There was a problem hiding this comment.
🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win
Hardcoded fork/branch override ignores SNAPSHOT and tests unintended code.
Lines 398–399 unconditionally reset REPO_URL/REPO_REV to a personal fork (are-ces/lightspeed-stack.git @ rhelai-konflux), discarding the values just derived from $SNAPSHOT (Lines 395–396). As written, the pipeline always tests the fork rather than the component under test, which must be reverted before merge per the existing TODO.
Would you like me to open an issue to track removing this override before merge?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml around
lines 395 - 399, The lightspeed-stack repo/revision selection in the Tekton test
script is being overwritten by a hardcoded fork and branch, so the pipeline
ignores the SNAPSHOT-derived values. Remove the unconditional REPO_URL and
REPO_REV reassignment in the lightspeed-stack test step and keep using the
values parsed from SNAPSHOT in that block, leaving any temporary override behind
the existing TODO only if it is explicitly gated for local use. Reference the
REPO_URL and REPO_REV assignments in the lightspeed-stack pipeline step when
updating this logic.
| [[ -n "$QUAY_ROBOT_NAME" ]] && log "✅ QUAY_ROBOT_NAME is set" || { echo "❌ Missing QUAY_ROBOT_NAME"; exit 1; } | ||
| [[ -n "$QUAY_ROBOT_PASSWORD" ]] && log "✅ QUAY_ROBOT_PASSWORD is set" || { echo "❌ Missing QUAY_ROBOT_PASSWORD"; exit 1; } | ||
| [[ -n "$OPENAI_API_KEY" ]] && log "✅ OPENAI_API_KEY is set" || { echo "❌ Missing OPENAI_API_KEY"; exit 1; } | ||
| [[ -n "${OPENAI_API_KEY:-}" ]] && log "✅ OPENAI_API_KEY is set" || { echo "❌ Missing OPENAI_API_KEY"; exit 1; } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
&& log || { exit 1 } can falsely fail when QUIET=1.
log() returns the status of [ "$QUIET" != "1" ], which is non-zero whenever QUIET=1. In that case the && log "…" arm returns non-zero and the || arm executes, printing ❌ Missing OPENAI_API_KEY and exiting 1 even though the key is set. Use an explicit if to decouple validation from the side-effecting log.
🐛 Proposed fix
-[[ -n "${OPENAI_API_KEY:-}" ]] && log "✅ OPENAI_API_KEY is set" || { echo "❌ Missing OPENAI_API_KEY"; exit 1; }
+if [[ -n "${OPENAI_API_KEY:-}" ]]; then
+ log "✅ OPENAI_API_KEY is set"
+else
+ echo "❌ Missing OPENAI_API_KEY"; exit 1
+fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| [[ -n "${OPENAI_API_KEY:-}" ]] && log "✅ OPENAI_API_KEY is set" || { echo "❌ Missing OPENAI_API_KEY"; exit 1; } | |
| if [[ -n "${OPENAI_API_KEY:-}" ]]; then | |
| log "✅ OPENAI_API_KEY is set" | |
| else | |
| echo "❌ Missing OPENAI_API_KEY"; exit 1 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/e2e-prow/rhoai/pipeline-konflux.sh` at line 54, The OPENAI_API_KEY
check in pipeline-konflux.sh is incorrectly tied to log()’s return value, so
`QUIET=1` can trigger the failure path even when the key exists. Update the
validation near the OPENAI_API_KEY guard to use an explicit conditional instead
of `&& ... || ...`, and keep the existence check separate from the `log` side
effect so `log()` cannot influence the exit behavior.
- Update run-rhelai.yaml: use base_url, VLLM_* env vars, restore comments - Add lightspeed-stack-rhelai.yaml: LCS config with vllm provider - Sync examples/vllm-rhelai.yaml with test config - Parameterize pipeline-konflux.sh for LLAMA_STACK_CONFIG, LCS_CONFIG, VLLM_URL, VLLM_MODEL, VLLM_API_KEY - Add optional VLLM env vars to pod manifests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Provisions a RHEL AI GPU instance via MAPT and an ephemeral OpenShift cluster, deploys lightspeed-stack with vLLM as inference provider, and runs the full behave e2e test suite. - OIDC federation for AWS auth (no static keys) - On-demand with region fallback (spot available via param) - Per-run S3 state isolation using PipelineRun name - Random API key per run for vLLM authentication - Tool calling via --vllm-extra-args - RHEL AI 3.4.0 GA, Llama-3.1-8B-Instruct, 131072 context window Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document MAPT usage, S3 state bucket, instance provisioning, GPU requirements, and AMI version management. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
Add a Konflux Tekton pipeline for running the full e2e test suite against RHEL AI instances provisioned on AWS. The pipeline uses MAPT to provision a GPU instance with vLLM (RHAIIS) auto-started, then deploys and tests lightspeed-stack as in the existing Konflux integration tests but configured to use the RHEL AI vLLM as its inference provider.
The pipeline provisions instances with 96GB+ total VRAM (4x GPU) because the e2e tests require a 131072-token context window — some test requests exceed 65K tokens and fail with smaller context. Single-GPU instances (24GB) cannot fit both the model weights and the required KV cache.
Key features:
pipeline-konflux.shto support both OpenAI and vLLM inference providersNew/modified files:
tests/e2e/configs/run-rhelai.yaml— Llama Stack config withremote::vllmprovidertests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml— LCS config with vllm as default providertests/e2e-prow/rhoai/pipeline-konflux.sh— parameterized for VLLM_URL, VLLM_MODEL, VLLM_API_KEYtests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml— optional vLLM env varstests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml— optional VLLM_MODEL env var.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml— full pipeline.tekton/integration-tests/README.md— documentationType of change
Tools used to create PR
Related Tickets & Documents
Checklist before requesting a review
Testing
Summary by CodeRabbit
New Features
Bug Fixes