LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI by are-ces · Pull Request #2028 · lightspeed-core/lightspeed-stack

are-ces · 2026-06-30T09:17:29Z

Description

Add a Konflux Tekton pipeline for running the full e2e test suite against RHEL AI instances provisioned on AWS. The pipeline uses MAPT to provision a GPU instance with vLLM (RHAIIS) auto-started, then deploys and tests lightspeed-stack as in the existing Konflux integration tests but configured to use the RHEL AI vLLM as its inference provider.

The pipeline provisions instances with 96GB+ total VRAM (4x GPU) because the e2e tests require a 131072-token context window — some test requests exceed 65K tokens and fail with smaller context. Single-GPU instances (24GB) cannot fit both the model weights and the required KV cache.

Key features:

RHEL AI provisioning via MAPT with auto-start, tool calling, and configurable context window
Spot/on-demand toggle with multi-instance-type fallback (g5.12xlarge, g6.12xlarge, g5.24xlarge, g6.24xlarge)
On-demand mode retries across 6 AWS regions with 10-minute timeout per attempt
Per-run S3 state isolation using PipelineRun name (no concurrent run conflicts)
Random API key per run for vLLM authentication
Parameterized pipeline-konflux.sh to support both OpenAI and vLLM inference providers
Integration tests README documenting MAPT, S3 bucket, provisioning modes, and AMI versioning

New/modified files:

tests/e2e/configs/run-rhelai.yaml — Llama Stack config with remote::vllm provider
tests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml — LCS config with vllm as default provider
tests/e2e-prow/rhoai/pipeline-konflux.sh — parameterized for VLLM_URL, VLLM_MODEL, VLLM_API_KEY
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml — optional vLLM env vars
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml — optional VLLM_MODEL env var
.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml — full pipeline
.tekton/integration-tests/README.md — documentation

Type of change

Tools used to create PR

Assisted-by: Claude Opus 4.6
Generated by: N/A

Related Tickets & Documents

Related Issue: LCORE-1724

Checklist before requesting a review

I have performed a self-review of my code.
PR has passed all pre-merge test jobs.
If it is a core feature, I have added thorough tests.

Testing

Pipeline tested end-to-end in Konflux with RHEL AI 3.4.0 GA on g5.12xlarge
270/276 e2e scenarios pass (6 failures due to model behavior differences between Llama-3.1-8B and gpt-4o-mini)
Spot and on-demand provisioning validated locally and in Konflux
Per-run S3 isolation verified with concurrent pipeline runs

Summary by CodeRabbit

New Features
- Added a new Konflux integration test setup for running end-to-end checks against both OpenAI and RHEL AI/vLLM.
- Introduced support for configurable vLLM-based inference, including model, URL, and API key settings.
- Added a new server-mode configuration for Lightspeed Core Service with external llama-stack connectivity and RAG support.
Bug Fixes
- Made several secret and environment settings optional to better support different test environments.
- Improved test cleanup so temporary cloud resources are removed even if a run fails.

coderabbitai · 2026-06-30T09:18:25Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 730ed797-7a90-4aca-88e0-b2dcbb5afeea

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

Walkthrough

Adds a new Tekton pipeline (lightspeed-stack-rhelai-tests-pipeline) for RHEL AI vLLM Konflux integration tests. The pipeline provisions a GPU-backed AWS RHEL AI instance via MAPT, creates an ephemeral OpenShift cluster, and runs E2E tests. Supporting changes add vLLM environment wiring to pod manifests, update run-rhelai.yaml to use env-sourced vLLM config, introduce a new LCS server-mode config, update pipeline-konflux.sh for conditional vLLM secret/config handling, and add a README.

Changes

RHEL AI Konflux Integration Pipeline

Layer / File(s)	Summary
RHEL AI run config and LCS server-mode config `tests/e2e/configs/run-rhelai.yaml`, `tests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml`	`run-rhelai.yaml` updated to use `env.VLLM_URL`, `env.VLLM_API_KEY`, and `env.VLLM_MODEL` for provider wiring and model registration; fixed `openai` allowed_models. New `lightspeed-stack-rhelai.yaml` adds a full LCS server-mode config with vLLM inference defaults and FAISS RAG.
Pipeline script and manifest vLLM env wiring `tests/e2e-prow/rhoai/pipeline-konflux.sh`, `tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml`, `tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml`	`pipeline-konflux.sh` conditionally creates vLLM Kubernetes secrets, uses configurable `LLAMA_STACK_CONFIG`/`LCS_CONFIG` paths for ConfigMaps, and selects `vllm` provider/model defaults when `VLLM_URL` is set. Pod manifests add optional `VLLM_URL`, `VLLM_API_KEY`, and `VLLM_MODEL` env vars sourced from secrets.
Tekton RHEL AI pipeline definition `.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml`	Defines `lightspeed-stack-rhelai-tests-pipeline` with `provision-rhelai` (spot/on-demand AWS instance, vLLM API key generation), `eaas-provision-space`, `provision-cluster`, `get-stack-images`, `rhelai-e2e-tests` (runs `pipeline-konflux.sh`), and a `finally` `destroy-rhelai` cleanup task.
Integration tests README `.tekton/integration-tests/README.md`	Documents available E2E pipelines, MAPT provisioning, S3 Pulumi state lifecycle/prefix isolation, spot vs on-demand behavior, default model assumptions, and AMI version selection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

lightspeed-core/lightspeed-stack#1449: Modifies tests/e2e/configs/run-rhelai.yaml tool_runtime MCP configuration that this PR also adjusts surrounding structure around.
lightspeed-core/lightspeed-stack#1741: Modifies tests/e2e-prow/rhoai/pipeline-konflux.sh for Konflux E2E provider/model selection, the same script this PR extends with vLLM conditional logic.

Suggested labels

Review effort 2/5

Suggested reviewers

radofuchs
tisnik

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly reflects the main change: adding a reliable CI workflow for deploying RHEL AI instances.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

✨ Simplify code

Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml:
- Around line 395-399: The lightspeed-stack repo/revision selection in the
Tekton test script is being overwritten by a hardcoded fork and branch, so the
pipeline ignores the SNAPSHOT-derived values. Remove the unconditional REPO_URL
and REPO_REV reassignment in the lightspeed-stack test step and keep using the
values parsed from SNAPSHOT in that block, leaving any temporary override behind
the existing TODO only if it is explicitly gated for local use. Reference the
REPO_URL and REPO_REV assignments in the lightspeed-stack pipeline step when
updating this logic.
- Around line 112-171: In the spot provisioning path of the Tekton step, the
exit status of `mapt aws rhel-ai create` is not checked, so failures can fall
through and emit empty results. Update the spot branch in the shell block to
guard the `mapt aws rhel-ai create` call the same way the on-demand path uses
`CREATED`, and fail fast with a clear error if creation does not succeed. Keep
the fix localized around the existing `if [[ "$(params.spot)" == "true" ]]`
branch and the subsequent result-writing commands so `host` and `vllm-api-key`
are only written after a successful create.
- Around line 348-349: Remove the onError: continue setting from the
run-e2e-tests task so failures are not masked when PIPELINE_EXIT is non-zero.
Update the task definition in lightspeed-stack-rhelai-test.yaml for
run-e2e-tests, and keep destroy-rhelai in finally as the cleanup path so the
pipeline correctly fails on e2e errors.

In `@tests/e2e-prow/rhoai/pipeline-konflux.sh`:
- Line 54: The OPENAI_API_KEY check in pipeline-konflux.sh is incorrectly tied
to log()’s return value, so `QUIET=1` can trigger the failure path even when the
key exists. Update the validation near the OPENAI_API_KEY guard to use an
explicit conditional instead of `&& ... || ...`, and keep the existence check
separate from the `log` side effect so `log()` cannot influence the exit
behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: aa3c3a80-cd0a-4ff3-8b89-20dfd359c0dd

📥 Commits

Reviewing files that changed from the base of the PR and between 8efa018 and cb7ad02.

📒 Files selected for processing (7)

.tekton/integration-tests/README.md
.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml
tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml
tests/e2e-prow/rhoai/pipeline-konflux.sh
tests/e2e/configs/run-rhelai.yaml
tests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml

📜 Review details

⏰ Context from checks skipped due to timeout. (12)

GitHub Check: build-pr
GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-0-6-on-pull-request
GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
GitHub Check: E2E: server mode / ci / group 1
GitHub Check: E2E: library mode / ci / group 2
GitHub Check: E2E: library mode / ci / group 1
GitHub Check: E2E: server mode / ci / group 2
GitHub Check: E2E: server mode / ci / group 3
GitHub Check: E2E: library mode / ci / group 3
GitHub Check: E2E Tests for Lightspeed Evaluation job
GitHub Check: integration_tests (3.12)
GitHub Check: integration_tests (3.13)

⚠️ CI failures not shown inline (4)

GitHub Actions: OpenAPI (Spectral) / 0_spectral.txt: LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI

Conclusion: failure

View job details

##[group]Run set -euo pipefail
 �[36;1mset -euo pipefail�[0m
 �[36;1muv run python scripts/generate_openapi_schema.py /tmp/openapi-generated.json�[0m
 �[36;1mif ! diff -u docs/openapi.json /tmp/openapi-generated.json; then�[0m
 �[36;1m  echo "::error::docs/openapi.json is out of date. Regenerate with: uv run scripts/generate_openapi_schema.py docs/openapi.json"�[0m

GitHub Actions: OpenAPI (Spectral) / spectral: LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI

Conclusion: failure

View job details

##[group]Run set -euo pipefail
 �[36;1mset -euo pipefail�[0m
 �[36;1muv run python scripts/generate_openapi_schema.py /tmp/openapi-generated.json�[0m
 �[36;1mif ! diff -u docs/openapi.json /tmp/openapi-generated.json; then�[0m
 �[36;1m  echo "::error::docs/openapi.json is out of date. Regenerate with: uv run scripts/generate_openapi_schema.py docs/openapi.json"�[0m

GitHub Actions: Unit tests / 1_unit_tests (3.13).txt: LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI

Conclusion: failure

View job details

##[group]Run uv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing
 �[36;1muv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing�[0m
 shell: /usr/bin/bash -e {0}
 env:
   UV_PYTHON: 3.13
   VIRTUAL_ENV: /home/runner/work/lightspeed-stack/lightspeed-stack/.venv
   UV_CACHE_DIR: /home/runner/work/_temp/setup-uv-cache
 ##[endgroup]
 Uninstalled 1 package in 3ms
 Installed 1 package in 3ms
 ============================= test session starts ==============================
 platform linux -- Python 3.13.14, pytest-9.1.1, pluggy-1.6.0
 benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
 rootdir: /home/runner/work/lightspeed-stack/lightspeed-stack
 configfile: pyproject.toml
 plugins: asyncio-1.4.0, benchmark-5.2.3, anyio-4.14.1, order-1.5.0, mock-3.15.1, cov-7.1.0, logfire-4.37.0
 asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
 collected 2925 items
 tests/unit/a2a_storage/test_in_memory_context_store.py ........          [  0%]
 tests/unit/a2a_storage/test_sqlite_context_store.py ..........           [  0%]
 tests/unit/a2a_storage/test_storage_factory.py ...........               [  0%]
 tests/unit/app/endpoints/test_a2a.py ..............................      [  2%]
 tests/unit/app/endpoints/test_authorized.py ...                          [  2%]
 tests/unit/app/endpoints/test_config.py ..                               [  2%]
 tests/unit/app/endpoints/test_conversations.py ......................... [  3%]
 .................                                                        [  3%]
 tests/unit/app/endpoints/test_conversations_v2.py ...................... [  4%]
 ...............                                                          [  4%]
 tests/unit/app/endpoints/test_feedback.py .......................        [  5%]
 tests/unit/ap...

GitHub Actions: Unit tests / unit_tests (3.13): LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI

Conclusion: failure

View job details

##[group]Run uv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing
 �[36;1muv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing�[0m
 shell: /usr/bin/bash -e {0}
 env:
   UV_PYTHON: 3.13
   VIRTUAL_ENV: /home/runner/work/lightspeed-stack/lightspeed-stack/.venv
   UV_CACHE_DIR: /home/runner/work/_temp/setup-uv-cache
 ##[endgroup]
 Uninstalled 1 package in 3ms
 Installed 1 package in 3ms
 ============================= test session starts ==============================
 platform linux -- Python 3.13.14, pytest-9.1.1, pluggy-1.6.0
 benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
 rootdir: /home/runner/work/lightspeed-stack/lightspeed-stack
 configfile: pyproject.toml
 plugins: asyncio-1.4.0, benchmark-5.2.3, anyio-4.14.1, order-1.5.0, mock-3.15.1, cov-7.1.0, logfire-4.37.0
 asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
 collected 2925 items
 tests/unit/a2a_storage/test_in_memory_context_store.py ........          [  0%]
 tests/unit/a2a_storage/test_sqlite_context_store.py ..........           [  0%]
 tests/unit/a2a_storage/test_storage_factory.py ...........               [  0%]
 tests/unit/app/endpoints/test_a2a.py ..............................      [  2%]
 tests/unit/app/endpoints/test_authorized.py ...                          [  2%]
 tests/unit/app/endpoints/test_config.py ..                               [  2%]
 tests/unit/app/endpoints/test_conversations.py ......................... [  3%]
 .................                                                        [  3%]
 tests/unit/app/endpoints/test_conversations_v2.py ...................... [  4%]
 ...............                                                          [  4%]
 tests/unit/app/endpoints/test_feedback.py .......................        [  5%]
 tests/unit/ap...

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2026-02-19T10:06:50.647Z

Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:50.647Z
Learning: In the rhoai tests under tests/e2e-prow/rhoai/manifests, avoid static ConfigMap definitions for mock-jwks-script and mcp-mock-server-script since these ConfigMaps are created dynamically by the pipeline.sh deployment script using 'oc create configmap'. Ensure there are no static ConfigMap resources for these names in the manifests. If such ConfigMaps are added in the future, coordinate with the pipeline to reflect dynamic creation or adjust tests to rely on the dynamic provisioning.

Applied to files:

tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml

📚 Learning: 2026-05-20T08:09:30.641Z

Learnt from: max-svistunov
Repo: lightspeed-core/lightspeed-stack PR: 1580
File: docs/design/llama-stack-config-merge/poc-results/library-mode/synthesized-run.yaml:107-110
Timestamp: 2026-05-20T08:09:30.641Z
Learning: In Llama-stack config YAMLs, when defining a Llama Guard safety shield entry, set `provider_shield_id` to the *guard model identifier* (e.g., `meta-llama/Llama-Guard-3-8B`). Do not use a chat/generative model id (e.g., `openai/gpt-4o-mini`): a chat-model id (or `native_override`) indicates only an override landed and does **not** mean the safety shield is actually gating queries. Ensure any E2E coverage for the related implementation (JIRA/E2E tests) exercises a real Llama Guard model to verify that the shield is effective.

Applied to files:

tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml
tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml
tests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml
tests/e2e/configs/run-rhelai.yaml

🪛 markdownlint-cli2 (0.22.1)

.tekton/integration-tests/README.md

[warning] 38-38: Files should end with a single newline character

(MD047, single-trailing-newline)

🔇 Additional comments (6)

tests/e2e/configs/run-rhelai.yaml (1)

24-32: LGTM!

tests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml (1)

21-35: LGTM!

tests/e2e-prow/rhoai/pipeline-konflux.sh (1)

75-79: LGTM!

Also applies to: 385-391

tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml (1)

33-38: LGTM!

tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml (1)

146-166: LGTM!
.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml (1)
90-91: 🩺 Stability & Availability

#!/bin/sh is fine for this image. ghcr.io/redhat-developer/mapt:pr-848 is based on UBI 9, so the shell supports the [[ ... ]] and pipefail usage here.
			> Likely an incorrect or invalid review comment.

coderabbitai · 2026-06-30T09:34:13Z

+              if [[ "$(params.spot)" == "true" ]]; then
+                export AWS_DEFAULT_REGION="us-east-1"
+                echo "[mapt] Using spot instances (searching all regions)..."
+                mapt aws rhel-ai create \
+                    --project-name "mapt-rhel-ai-${RUN_ID}" \
+                    --backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \
+                    --conn-details-output /opt/host-info \
+                    --compute-sizes "$(params.instance-type)" \
+                    --version "$(params.rhelai-version)" \
+                    ${SPOT_ARGS} \
+                    --auto-start \
+                    --model "$(params.model)" \
+                    --hf-token "${HF_TOKEN}" \
+                    --api-key "${VLLM_API_KEY}" \
+                    --expose-ports 8000 \
+                    --vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \
+                    --tags "project=lightspeed-core,environment=konflux-ci"
+              else
+                REGIONS="us-east-1 us-east-2 us-west-2 eu-west-1 eu-central-1 ap-northeast-1"
+                TIMEOUT=600
+                CREATED=0
+
+                for REGION in $REGIONS; do
+                  echo "[mapt] Trying on-demand in ${REGION}..."
+                  export AWS_DEFAULT_REGION="$REGION"
+
+                  if timeout $TIMEOUT mapt aws rhel-ai create \
+                      --project-name "mapt-rhel-ai-${RUN_ID}" \
+                      --backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \
+                      --conn-details-output /opt/host-info \
+                      --compute-sizes "$(params.instance-type)" \
+                      --version "$(params.rhelai-version)" \
+                      --auto-start \
+                      --model "$(params.model)" \
+                      --hf-token "${HF_TOKEN}" \
+                      --api-key "${VLLM_API_KEY}" \
+                      --expose-ports 8000 \
+                      --vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \
+                      --tags "project=lightspeed-core,environment=konflux-ci"; then
+                    CREATED=1
+                    break
+                  fi
+
+                  echo "[mapt] Failed in ${REGION}, cleaning up and trying next..."
+                  mapt aws rhel-ai destroy \
+                      --project-name "mapt-rhel-ai-${RUN_ID}" \
+                      --backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \
+                      --force-destroy 2>/dev/null || true
+                done
+
+                if [ "$CREATED" -ne 1 ]; then
+                  echo "[mapt] ERROR: Failed to create instance in any region"
+                  exit 1
+                fi
+              fi
+
+              echo "[mapt] Instance created and vLLM started."
+              echo -n "${VLLM_API_KEY}" > $(results.vllm-api-key.path)
+              echo -n "$(cat /opt/host-info/host)" > $(results.host.path)
+              echo -n "$(cat /opt/host-info/username)" > $(results.username.path)


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Spot provisioning failures are not detected.

The script runs with set -uo pipefail (no -e). In the spot branch the exit status of mapt aws rhel-ai create is never checked, unlike the on-demand branch which uses the CREATED guard. If spot creation fails, execution falls through to Lines 168–171, where cat /opt/host-info/host fails (ignored, no -e) and empty host/vllm-api-key results are emitted, causing the e2e task to run against a non-existent endpoint instead of failing fast.

🐛 Proposed fix (spot branch)

mapt aws rhel-ai create \ ... - --tags "project=lightspeed-core,environment=konflux-ci" + --tags "project=lightspeed-core,environment=konflux-ci" || { + echo "[mapt] ERROR: spot creation failed"; exit 1; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [[ "$(params.spot)" == "true" ]]; then

export AWS_DEFAULT_REGION="us-east-1"

echo "[mapt] Using spot instances (searching all regions)..."

mapt aws rhel-ai create \

--project-name "mapt-rhel-ai-${RUN_ID}" \

--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \

--conn-details-output /opt/host-info \

--compute-sizes "$(params.instance-type)" \

--version "$(params.rhelai-version)" \

${SPOT_ARGS} \

--auto-start \

--model "$(params.model)" \

--hf-token "${HF_TOKEN}" \

--api-key "${VLLM_API_KEY}" \

--expose-ports 8000 \

--vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \

--tags "project=lightspeed-core,environment=konflux-ci"

else

REGIONS="us-east-1 us-east-2 us-west-2 eu-west-1 eu-central-1 ap-northeast-1"

TIMEOUT=600

CREATED=0

for REGION in $REGIONS; do

echo "[mapt] Trying on-demand in ${REGION}..."

export AWS_DEFAULT_REGION="$REGION"

if timeout $TIMEOUT mapt aws rhel-ai create \

--project-name "mapt-rhel-ai-${RUN_ID}" \

--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \

--conn-details-output /opt/host-info \

--compute-sizes "$(params.instance-type)" \

--version "$(params.rhelai-version)" \

--auto-start \

--model "$(params.model)" \

--hf-token "${HF_TOKEN}" \

--api-key "${VLLM_API_KEY}" \

--expose-ports 8000 \

--vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \

--tags "project=lightspeed-core,environment=konflux-ci"; then

CREATED=1

break

fi

echo "[mapt] Failed in ${REGION}, cleaning up and trying next..."

mapt aws rhel-ai destroy \

--project-name "mapt-rhel-ai-${RUN_ID}" \

--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \

--force-destroy 2>/dev/null || true

done

if [ "$CREATED" -ne 1 ]; then

echo "[mapt] ERROR: Failed to create instance in any region"

exit 1

fi

fi

echo "[mapt] Instance created and vLLM started."

echo -n "${VLLM_API_KEY}" > $(results.vllm-api-key.path)

echo -n "$(cat /opt/host-info/host)" > $(results.host.path)

echo -n "$(cat /opt/host-info/username)" > $(results.username.path)

if [[ "$(params.spot)" == "true" ]]; then

export AWS_DEFAULT_REGION="us-east-1"

echo "[mapt] Using spot instances (searching all regions)..."

mapt aws rhel-ai create \

--project-name "mapt-rhel-ai-${RUN_ID}" \

--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \

--conn-details-output /opt/host-info \

--compute-sizes "$(params.instance-type)" \

--version "$(params.rhelai-version)" \

${SPOT_ARGS} \

--auto-start \

--model "$(params.model)" \

--hf-token "${HF_TOKEN}" \

--api-key "${VLLM_API_KEY}" \

--expose-ports 8000 \

--vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \

--tags "project=lightspeed-core,environment=konflux-ci" || {

echo "[mapt] ERROR: spot creation failed"; exit 1; }

else

REGIONS="us-east-1 us-east-2 us-west-2 eu-west-1 eu-central-1 ap-northeast-1"

TIMEOUT=600

CREATED=0

for REGION in $REGIONS; do

echo "[mapt] Trying on-demand in ${REGION}..."

export AWS_DEFAULT_REGION="$REGION"

if timeout $TIMEOUT mapt aws rhel-ai create \

--project-name "mapt-rhel-ai-${RUN_ID}" \

--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \

--conn-details-output /opt/host-info \

--compute-sizes "$(params.instance-type)" \

--version "$(params.rhelai-version)" \

--auto-start \

--model "$(params.model)" \

--hf-token "${HF_TOKEN}" \

--api-key "${VLLM_API_KEY}" \

--expose-ports 8000 \

--vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \

--tags "project=lightspeed-core,environment=konflux-ci"; then

CREATED=1

break

fi

echo "[mapt] Failed in ${REGION}, cleaning up and trying next..."

mapt aws rhel-ai destroy \

--project-name "mapt-rhel-ai-${RUN_ID}" \

--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \

--force-destroy 2>/dev/null || true

done

if [ "$CREATED" -ne 1 ]; then

echo "[mapt] ERROR: Failed to create instance in any region"

exit 1

fi

fi

echo "[mapt] Instance created and vLLM started."

echo -n "${VLLM_API_KEY}" > $(results.vllm-api-key.path)

echo -n "$(cat /opt/host-info/host)" > $(results.host.path)

echo -n "$(cat /opt/host-info/username)" > $(results.username.path)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml around lines 112 - 171, In the spot provisioning path of the Tekton step, the exit status of `mapt aws rhel-ai create` is not checked, so failures can fall through and emit empty results. Update the spot branch in the shell block to guard the `mapt aws rhel-ai create` call the same way the on-demand path uses `CREATED`, and fail fast with a clear error if creation does not succeed. Keep the fix localized around the existing `if [[ "$(params.spot)" == "true" ]]` branch and the subsequent result-writing commands so `host` and `vllm-api-key` are only written after a successful create.

coderabbitai · 2026-06-30T09:34:13Z

+          - name: run-e2e-tests
+            onError: continue


🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail git ls-files '.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml' wc -l .tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml sed -n '300,420p' .tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml rg -n 'run-e2e-tests|destroy-rhelai|finally|onError|exit \$PIPELINE_EXIT' .tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml

Repository: lightspeed-core/lightspeed-stack

Length of output: 5730

🌐 Web query:

Tekton step onError continue semantics task pipeline run failure continue step exits non-zero

💡 Result:

In Tekton, the onError field allows you to control the behavior of a Task when a step exits with a non-zero exit code [1][2]. By default, if a step fails, the remaining steps in the Task are skipped and the TaskRun is marked as a failure [1][3]. When you set onError: continue for a step, the following semantics apply: 1. Step Execution: If the step fails (i.e., exits with a non-zero code), Tekton does not stop the TaskRun [1][2]. Instead, it proceeds to execute the subsequent steps in the Task [1][4]. 2. Exit Code Visibility: Although the step failure is ignored for the purpose of halting the TaskRun, the original exit code is preserved [1][2]. You can access this exit code by reading the file specified by the path variable $(steps.step-.exitCode.path) or $(steps.step-unnamed-.exitCode.path) [5][6]. 3. Task Results: If a step fails but manages to initialize a result file before exiting, that result is made available to subsequent tasks [1][3]. However, if the step fails before creating a result that a downstream task expects, the pipeline may fail with an InvalidTaskResultReference error [3][2]. 4. Precedence: The onError behavior is overridden if a breakpoint is active for that step [1][4]. Debugging features allow you to manually dictate whether a step should be treated as a success or failure, which takes precedence over the onError configuration [1][7]. This functionality is specific to the step level within a Task [1][8]. It is distinct from the onError semantics available at the Pipeline level, which controls whether an entire PipelineTask failure stops the overall PipelineRun [8].

Citations:

1: https://tekton.dev/docs/pipelines/tasks/

2: https://github.com/tektoncd/community/blob/main/teps/0040-ignore-step-errors.md

3: https://github.com/tektoncd/pipeline/blob/main/docs/tasks.md

4: TEP-0040 implementation - specifying onError in a step tektoncd/pipeline#4106

5: https://github.com/tektoncd/pipeline/blob/main/docs/developers/taskruns.md

6: https://github.com/tektoncd/pipeline/blob/release-v1.3.x/examples/v1/taskruns/ignore-step-error.yaml

7: https://tekton.dev/docs/pipelines/debug/

8: https://github.com/tektoncd/pipeline/blob/main/docs/pipelines.md

Remove onError: continue from run-e2e-tests onError: continue keeps the task/pipeline green even when exit $PIPELINE_EXIT is non-zero. destroy-rhelai already runs in finally, so this flag only masks e2e failures.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml around lines 348 - 349, Remove the onError: continue setting from the run-e2e-tests task so failures are not masked when PIPELINE_EXIT is non-zero. Update the task definition in lightspeed-stack-rhelai-test.yaml for run-e2e-tests, and keep destroy-rhelai in finally as the cleanup path so the pipeline correctly fails on e2e errors.

coderabbitai · 2026-06-30T09:34:13Z

+              REPO_URL=$(jq -r '.components[] | select(.name == "lightspeed-stack") | .source.git.url // "https://github.com/lightspeed-core/lightspeed-stack.git"' <<< "$SNAPSHOT")
+              REPO_REV=$(jq -r '.components[] | select(.name == "lightspeed-stack") | .source.git.revision // "main"' <<< "$SNAPSHOT")
+              # TODO: remove branch override once merged to main
+              REPO_URL="https://github.com/are-ces/lightspeed-stack.git"
+              REPO_REV="rhelai-konflux"


🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

Hardcoded fork/branch override ignores SNAPSHOT and tests unintended code.

Lines 398–399 unconditionally reset REPO_URL/REPO_REV to a personal fork (are-ces/lightspeed-stack.git @ rhelai-konflux), discarding the values just derived from $SNAPSHOT (Lines 395–396). As written, the pipeline always tests the fork rather than the component under test, which must be reverted before merge per the existing TODO.

Would you like me to open an issue to track removing this override before merge?

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml around lines 395 - 399, The lightspeed-stack repo/revision selection in the Tekton test script is being overwritten by a hardcoded fork and branch, so the pipeline ignores the SNAPSHOT-derived values. Remove the unconditional REPO_URL and REPO_REV reassignment in the lightspeed-stack test step and keep using the values parsed from SNAPSHOT in that block, leaving any temporary override behind the existing TODO only if it is explicitly gated for local use. Reference the REPO_URL and REPO_REV assignments in the lightspeed-stack pipeline step when updating this logic.

coderabbitai · 2026-06-30T09:34:13Z

 [[ -n "$QUAY_ROBOT_NAME" ]] && log "✅ QUAY_ROBOT_NAME is set" || { echo "❌ Missing QUAY_ROBOT_NAME"; exit 1; }
 [[ -n "$QUAY_ROBOT_PASSWORD" ]] && log "✅ QUAY_ROBOT_PASSWORD is set" || { echo "❌ Missing QUAY_ROBOT_PASSWORD"; exit 1; }
-[[ -n "$OPENAI_API_KEY" ]] && log "✅ OPENAI_API_KEY is set" || { echo "❌ Missing OPENAI_API_KEY"; exit 1; }
+[[ -n "${OPENAI_API_KEY:-}" ]] && log "✅ OPENAI_API_KEY is set" || { echo "❌ Missing OPENAI_API_KEY"; exit 1; }


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

&& log || { exit 1 } can falsely fail when QUIET=1.

log() returns the status of [ "$QUIET" != "1" ], which is non-zero whenever QUIET=1. In that case the && log "…" arm returns non-zero and the || arm executes, printing ❌ Missing OPENAI_API_KEY and exiting 1 even though the key is set. Use an explicit if to decouple validation from the side-effecting log.

🐛 Proposed fix

-[[ -n "${OPENAI_API_KEY:-}" ]] && log "✅ OPENAI_API_KEY is set" || { echo "❌ Missing OPENAI_API_KEY"; exit 1; } +if [[ -n "${OPENAI_API_KEY:-}" ]]; then + log "✅ OPENAI_API_KEY is set" +else + echo "❌ Missing OPENAI_API_KEY"; exit 1 +fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

[[ -n "${OPENAI_API_KEY:-}" ]] && log "✅ OPENAI_API_KEY is set" || { echo "❌ Missing OPENAI_API_KEY"; exit 1; }

if [[ -n "${OPENAI_API_KEY:-}" ]]; then

log "✅ OPENAI_API_KEY is set"

else

echo "❌ Missing OPENAI_API_KEY"; exit 1

fi

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/e2e-prow/rhoai/pipeline-konflux.sh` at line 54, The OPENAI_API_KEY check in pipeline-konflux.sh is incorrectly tied to log()’s return value, so `QUIET=1` can trigger the failure path even when the key exists. Update the validation near the OPENAI_API_KEY guard to use an explicit conditional instead of `&& ... || ...`, and keep the existence check separate from the `log` side effect so `log()` cannot influence the exit behavior.

- Update run-rhelai.yaml: use base_url, VLLM_* env vars, restore comments - Add lightspeed-stack-rhelai.yaml: LCS config with vllm provider - Sync examples/vllm-rhelai.yaml with test config - Parameterize pipeline-konflux.sh for LLAMA_STACK_CONFIG, LCS_CONFIG, VLLM_URL, VLLM_MODEL, VLLM_API_KEY - Add optional VLLM env vars to pod manifests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provisions a RHEL AI GPU instance via MAPT and an ephemeral OpenShift cluster, deploys lightspeed-stack with vLLM as inference provider, and runs the full behave e2e test suite. - OIDC federation for AWS auth (no static keys) - On-demand with region fallback (spot available via param) - Per-run S3 state isolation using PipelineRun name - Random API key per run for vLLM authentication - Tool calling via --vllm-extra-args - RHEL AI 3.4.0 GA, Llama-3.1-8B-Instruct, 131072 context window Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Document MAPT usage, S3 state bucket, instance provisioning, GPU requirements, and AMI version management. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed Jun 30, 2026

View reviewed changes

are-ces marked this pull request as draft June 30, 2026 10:38

are-ces and others added 3 commits June 30, 2026 14:54

LCORE-1724: add integration tests README

9f01d85

Document MAPT usage, S3 state bucket, instance provisioning, GPU requirements, and AMI version management. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

are-ces force-pushed the rhelai-konflux branch from 6234fe4 to 9f01d85 Compare June 30, 2026 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI#2028

LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI#2028
are-ces wants to merge 3 commits into
lightspeed-core:mainfrom
are-ces:rhelai-konflux

are-ces commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

Review skipped

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

coderabbitai Bot Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

are-ces commented Jun 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Tools used to create PR

Related Tickets & Documents

Checklist before requesting a review

Testing

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

are-ces commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading