Skip to content

LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI#2028

Draft
are-ces wants to merge 3 commits into
lightspeed-core:mainfrom
are-ces:rhelai-konflux
Draft

LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI#2028
are-ces wants to merge 3 commits into
lightspeed-core:mainfrom
are-ces:rhelai-konflux

Conversation

@are-ces

@are-ces are-ces commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Description

Add a Konflux Tekton pipeline for running the full e2e test suite against RHEL AI instances provisioned on AWS. The pipeline uses MAPT to provision a GPU instance with vLLM (RHAIIS) auto-started, then deploys and tests lightspeed-stack as in the existing Konflux integration tests but configured to use the RHEL AI vLLM as its inference provider.

The pipeline provisions instances with 96GB+ total VRAM (4x GPU) because the e2e tests require a 131072-token context window — some test requests exceed 65K tokens and fail with smaller context. Single-GPU instances (24GB) cannot fit both the model weights and the required KV cache.

Key features:

  • RHEL AI provisioning via MAPT with auto-start, tool calling, and configurable context window
  • Spot/on-demand toggle with multi-instance-type fallback (g5.12xlarge, g6.12xlarge, g5.24xlarge, g6.24xlarge)
  • On-demand mode retries across 6 AWS regions with 10-minute timeout per attempt
  • Per-run S3 state isolation using PipelineRun name (no concurrent run conflicts)
  • Random API key per run for vLLM authentication
  • Parameterized pipeline-konflux.sh to support both OpenAI and vLLM inference providers
  • Integration tests README documenting MAPT, S3 bucket, provisioning modes, and AMI versioning

New/modified files:

  • tests/e2e/configs/run-rhelai.yaml — Llama Stack config with remote::vllm provider
  • tests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml — LCS config with vllm as default provider
  • tests/e2e-prow/rhoai/pipeline-konflux.sh — parameterized for VLLM_URL, VLLM_MODEL, VLLM_API_KEY
  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml — optional vLLM env vars
  • tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml — optional VLLM_MODEL env var
  • .tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml — full pipeline
  • .tekton/integration-tests/README.md — documentation

Type of change

  • Refactor
  • New feature
  • Bug fix
  • CVE fix
  • Optimization
  • Documentation Update
  • Configuration Update
  • Bump-up service version
  • Bump-up dependent library
  • Bump-up library or tool used for development (does not change the final image)
  • CI configuration change
  • Konflux configuration change
  • Unit tests improvement
  • Integration tests improvement
  • End to end tests improvement
  • Benchmarks improvement

Tools used to create PR

  • Assisted-by: Claude Opus 4.6
  • Generated by: N/A

Related Tickets & Documents

Checklist before requesting a review

  • I have performed a self-review of my code.
  • PR has passed all pre-merge test jobs.
  • If it is a core feature, I have added thorough tests.

Testing

  1. Pipeline tested end-to-end in Konflux with RHEL AI 3.4.0 GA on g5.12xlarge
  2. 270/276 e2e scenarios pass (6 failures due to model behavior differences between Llama-3.1-8B and gpt-4o-mini)
  3. Spot and on-demand provisioning validated locally and in Konflux
  4. Per-run S3 isolation verified with concurrent pipeline runs

Summary by CodeRabbit

  • New Features

    • Added a new Konflux integration test setup for running end-to-end checks against both OpenAI and RHEL AI/vLLM.
    • Introduced support for configurable vLLM-based inference, including model, URL, and API key settings.
    • Added a new server-mode configuration for Lightspeed Core Service with external llama-stack connectivity and RAG support.
  • Bug Fixes

    • Made several secret and environment settings optional to better support different test environments.
    • Improved test cleanup so temporary cloud resources are removed even if a run fails.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 730ed797-7a90-4aca-88e0-b2dcbb5afeea

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

Adds a new Tekton pipeline (lightspeed-stack-rhelai-tests-pipeline) for RHEL AI vLLM Konflux integration tests. The pipeline provisions a GPU-backed AWS RHEL AI instance via MAPT, creates an ephemeral OpenShift cluster, and runs E2E tests. Supporting changes add vLLM environment wiring to pod manifests, update run-rhelai.yaml to use env-sourced vLLM config, introduce a new LCS server-mode config, update pipeline-konflux.sh for conditional vLLM secret/config handling, and add a README.

Changes

RHEL AI Konflux Integration Pipeline

Layer / File(s) Summary
RHEL AI run config and LCS server-mode config
tests/e2e/configs/run-rhelai.yaml, tests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml
run-rhelai.yaml updated to use env.VLLM_URL, env.VLLM_API_KEY, and env.VLLM_MODEL for provider wiring and model registration; fixed openai allowed_models. New lightspeed-stack-rhelai.yaml adds a full LCS server-mode config with vLLM inference defaults and FAISS RAG.
Pipeline script and manifest vLLM env wiring
tests/e2e-prow/rhoai/pipeline-konflux.sh, tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml, tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml
pipeline-konflux.sh conditionally creates vLLM Kubernetes secrets, uses configurable LLAMA_STACK_CONFIG/LCS_CONFIG paths for ConfigMaps, and selects vllm provider/model defaults when VLLM_URL is set. Pod manifests add optional VLLM_URL, VLLM_API_KEY, and VLLM_MODEL env vars sourced from secrets.
Tekton RHEL AI pipeline definition
.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml
Defines lightspeed-stack-rhelai-tests-pipeline with provision-rhelai (spot/on-demand AWS instance, vLLM API key generation), eaas-provision-space, provision-cluster, get-stack-images, rhelai-e2e-tests (runs pipeline-konflux.sh), and a finally destroy-rhelai cleanup task.
Integration tests README
.tekton/integration-tests/README.md
Documents available E2E pipelines, MAPT provisioning, S3 Pulumi state lifecycle/prefix isolation, spot vs on-demand behavior, default model assumptions, and AMI version selection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

Review effort 2/5

Suggested reviewers

  • radofuchs
  • tisnik
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly reflects the main change: adding a reliable CI workflow for deploying RHEL AI instances.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
✨ Simplify code
  • Create PR with simplified code

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml:
- Around line 395-399: The lightspeed-stack repo/revision selection in the
Tekton test script is being overwritten by a hardcoded fork and branch, so the
pipeline ignores the SNAPSHOT-derived values. Remove the unconditional REPO_URL
and REPO_REV reassignment in the lightspeed-stack test step and keep using the
values parsed from SNAPSHOT in that block, leaving any temporary override behind
the existing TODO only if it is explicitly gated for local use. Reference the
REPO_URL and REPO_REV assignments in the lightspeed-stack pipeline step when
updating this logic.
- Around line 112-171: In the spot provisioning path of the Tekton step, the
exit status of `mapt aws rhel-ai create` is not checked, so failures can fall
through and emit empty results. Update the spot branch in the shell block to
guard the `mapt aws rhel-ai create` call the same way the on-demand path uses
`CREATED`, and fail fast with a clear error if creation does not succeed. Keep
the fix localized around the existing `if [[ "$(params.spot)" == "true" ]]`
branch and the subsequent result-writing commands so `host` and `vllm-api-key`
are only written after a successful create.
- Around line 348-349: Remove the onError: continue setting from the
run-e2e-tests task so failures are not masked when PIPELINE_EXIT is non-zero.
Update the task definition in lightspeed-stack-rhelai-test.yaml for
run-e2e-tests, and keep destroy-rhelai in finally as the cleanup path so the
pipeline correctly fails on e2e errors.

In `@tests/e2e-prow/rhoai/pipeline-konflux.sh`:
- Line 54: The OPENAI_API_KEY check in pipeline-konflux.sh is incorrectly tied
to log()’s return value, so `QUIET=1` can trigger the failure path even when the
key exists. Update the validation near the OPENAI_API_KEY guard to use an
explicit conditional instead of `&& ... || ...`, and keep the existence check
separate from the `log` side effect so `log()` cannot influence the exit
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: aa3c3a80-cd0a-4ff3-8b89-20dfd359c0dd

📥 Commits

Reviewing files that changed from the base of the PR and between 8efa018 and cb7ad02.

📒 Files selected for processing (7)
  • .tekton/integration-tests/README.md
  • .tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml
  • tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml
  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml
  • tests/e2e-prow/rhoai/pipeline-konflux.sh
  • tests/e2e/configs/run-rhelai.yaml
  • tests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml
📜 Review details
⏰ Context from checks skipped due to timeout. (12)
  • GitHub Check: build-pr
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-0-6-on-pull-request
  • GitHub Check: Konflux kflux-prd-rh02 / lightspeed-stack-on-pull-request
  • GitHub Check: E2E: server mode / ci / group 1
  • GitHub Check: E2E: library mode / ci / group 2
  • GitHub Check: E2E: library mode / ci / group 1
  • GitHub Check: E2E: server mode / ci / group 2
  • GitHub Check: E2E: server mode / ci / group 3
  • GitHub Check: E2E: library mode / ci / group 3
  • GitHub Check: E2E Tests for Lightspeed Evaluation job
  • GitHub Check: integration_tests (3.12)
  • GitHub Check: integration_tests (3.13)
⚠️ CI failures not shown inline (4)

GitHub Actions: OpenAPI (Spectral) / 0_spectral.txt: LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI

Conclusion: failure

View job details

##[group]Run set -euo pipefail
 �[36;1mset -euo pipefail�[0m
 �[36;1muv run python scripts/generate_openapi_schema.py /tmp/openapi-generated.json�[0m
 �[36;1mif ! diff -u docs/openapi.json /tmp/openapi-generated.json; then�[0m
 �[36;1m  echo "::error::docs/openapi.json is out of date. Regenerate with: uv run scripts/generate_openapi_schema.py docs/openapi.json"�[0m

GitHub Actions: OpenAPI (Spectral) / spectral: LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI

Conclusion: failure

View job details

##[group]Run set -euo pipefail
 �[36;1mset -euo pipefail�[0m
 �[36;1muv run python scripts/generate_openapi_schema.py /tmp/openapi-generated.json�[0m
 �[36;1mif ! diff -u docs/openapi.json /tmp/openapi-generated.json; then�[0m
 �[36;1m  echo "::error::docs/openapi.json is out of date. Regenerate with: uv run scripts/generate_openapi_schema.py docs/openapi.json"�[0m

GitHub Actions: Unit tests / 1_unit_tests (3.13).txt: LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI

Conclusion: failure

View job details

##[group]Run uv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing
 �[36;1muv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing�[0m
 shell: /usr/bin/bash -e {0}
 env:
   UV_PYTHON: 3.13
   VIRTUAL_ENV: /home/runner/work/lightspeed-stack/lightspeed-stack/.venv
   UV_CACHE_DIR: /home/runner/work/_temp/setup-uv-cache
 ##[endgroup]
 Uninstalled 1 package in 3ms
 Installed 1 package in 3ms
 ============================= test session starts ==============================
 platform linux -- Python 3.13.14, pytest-9.1.1, pluggy-1.6.0
 benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
 rootdir: /home/runner/work/lightspeed-stack/lightspeed-stack
 configfile: pyproject.toml
 plugins: asyncio-1.4.0, benchmark-5.2.3, anyio-4.14.1, order-1.5.0, mock-3.15.1, cov-7.1.0, logfire-4.37.0
 asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
 collected 2925 items
 tests/unit/a2a_storage/test_in_memory_context_store.py ........          [  0%]
 tests/unit/a2a_storage/test_sqlite_context_store.py ..........           [  0%]
 tests/unit/a2a_storage/test_storage_factory.py ...........               [  0%]
 tests/unit/app/endpoints/test_a2a.py ..............................      [  2%]
 tests/unit/app/endpoints/test_authorized.py ...                          [  2%]
 tests/unit/app/endpoints/test_config.py ..                               [  2%]
 tests/unit/app/endpoints/test_conversations.py ......................... [  3%]
 .................                                                        [  3%]
 tests/unit/app/endpoints/test_conversations_v2.py ...................... [  4%]
 ...............                                                          [  4%]
 tests/unit/app/endpoints/test_feedback.py .......................        [  5%]
 tests/unit/ap...

GitHub Actions: Unit tests / unit_tests (3.13): LCORE-1724: Establish a reliable method for deploying RHEL AI instances in CI

Conclusion: failure

View job details

##[group]Run uv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing
 �[36;1muv run pytest tests/unit --cov=src --cov=runner --cov-report term-missing�[0m
 shell: /usr/bin/bash -e {0}
 env:
   UV_PYTHON: 3.13
   VIRTUAL_ENV: /home/runner/work/lightspeed-stack/lightspeed-stack/.venv
   UV_CACHE_DIR: /home/runner/work/_temp/setup-uv-cache
 ##[endgroup]
 Uninstalled 1 package in 3ms
 Installed 1 package in 3ms
 ============================= test session starts ==============================
 platform linux -- Python 3.13.14, pytest-9.1.1, pluggy-1.6.0
 benchmark: 5.2.3 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
 rootdir: /home/runner/work/lightspeed-stack/lightspeed-stack
 configfile: pyproject.toml
 plugins: asyncio-1.4.0, benchmark-5.2.3, anyio-4.14.1, order-1.5.0, mock-3.15.1, cov-7.1.0, logfire-4.37.0
 asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
 collected 2925 items
 tests/unit/a2a_storage/test_in_memory_context_store.py ........          [  0%]
 tests/unit/a2a_storage/test_sqlite_context_store.py ..........           [  0%]
 tests/unit/a2a_storage/test_storage_factory.py ...........               [  0%]
 tests/unit/app/endpoints/test_a2a.py ..............................      [  2%]
 tests/unit/app/endpoints/test_authorized.py ...                          [  2%]
 tests/unit/app/endpoints/test_config.py ..                               [  2%]
 tests/unit/app/endpoints/test_conversations.py ......................... [  3%]
 .................                                                        [  3%]
 tests/unit/app/endpoints/test_conversations_v2.py ...................... [  4%]
 ...............                                                          [  4%]
 tests/unit/app/endpoints/test_feedback.py .......................        [  5%]
 tests/unit/ap...
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2026-02-19T10:06:50.647Z
Learnt from: radofuchs
Repo: lightspeed-core/lightspeed-stack PR: 1181
File: tests/e2e-prow/rhoai/manifests/lightspeed/mock-jwks.yaml:32-34
Timestamp: 2026-02-19T10:06:50.647Z
Learning: In the rhoai tests under tests/e2e-prow/rhoai/manifests, avoid static ConfigMap definitions for mock-jwks-script and mcp-mock-server-script since these ConfigMaps are created dynamically by the pipeline.sh deployment script using 'oc create configmap'. Ensure there are no static ConfigMap resources for these names in the manifests. If such ConfigMaps are added in the future, coordinate with the pipeline to reflect dynamic creation or adjust tests to rely on the dynamic provisioning.

Applied to files:

  • tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml
  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml
📚 Learning: 2026-05-20T08:09:30.641Z
Learnt from: max-svistunov
Repo: lightspeed-core/lightspeed-stack PR: 1580
File: docs/design/llama-stack-config-merge/poc-results/library-mode/synthesized-run.yaml:107-110
Timestamp: 2026-05-20T08:09:30.641Z
Learning: In Llama-stack config YAMLs, when defining a Llama Guard safety shield entry, set `provider_shield_id` to the *guard model identifier* (e.g., `meta-llama/Llama-Guard-3-8B`). Do not use a chat/generative model id (e.g., `openai/gpt-4o-mini`): a chat-model id (or `native_override`) indicates only an override landed and does **not** mean the safety shield is actually gating queries. Ensure any E2E coverage for the related implementation (JIRA/E2E tests) exercises a real Llama Guard model to verify that the shield is effective.

Applied to files:

  • tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml
  • tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml
  • tests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml
  • tests/e2e/configs/run-rhelai.yaml
🪛 markdownlint-cli2 (0.22.1)
.tekton/integration-tests/README.md

[warning] 38-38: Files should end with a single newline character

(MD047, single-trailing-newline)

🔇 Additional comments (6)
tests/e2e/configs/run-rhelai.yaml (1)

24-32: LGTM!

tests/e2e/configuration/server-mode/lightspeed-stack-rhelai.yaml (1)

21-35: LGTM!

tests/e2e-prow/rhoai/pipeline-konflux.sh (1)

75-79: LGTM!

Also applies to: 385-391

tests/e2e-prow/rhoai/manifests/lightspeed/lightspeed-stack.yaml (1)

33-38: LGTM!

tests/e2e-prow/rhoai/manifests/lightspeed/llama-stack-openai.yaml (1)

146-166: LGTM!

.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml (1)

90-91: 🩺 Stability & Availability

#!/bin/sh is fine for this image. ghcr.io/redhat-developer/mapt:pr-848 is based on UBI 9, so the shell supports the [[ ... ]] and pipefail usage here.

			> Likely an incorrect or invalid review comment.

Comment on lines +112 to +171
if [[ "$(params.spot)" == "true" ]]; then
export AWS_DEFAULT_REGION="us-east-1"
echo "[mapt] Using spot instances (searching all regions)..."
mapt aws rhel-ai create \
--project-name "mapt-rhel-ai-${RUN_ID}" \
--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \
--conn-details-output /opt/host-info \
--compute-sizes "$(params.instance-type)" \
--version "$(params.rhelai-version)" \
${SPOT_ARGS} \
--auto-start \
--model "$(params.model)" \
--hf-token "${HF_TOKEN}" \
--api-key "${VLLM_API_KEY}" \
--expose-ports 8000 \
--vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \
--tags "project=lightspeed-core,environment=konflux-ci"
else
REGIONS="us-east-1 us-east-2 us-west-2 eu-west-1 eu-central-1 ap-northeast-1"
TIMEOUT=600
CREATED=0

for REGION in $REGIONS; do
echo "[mapt] Trying on-demand in ${REGION}..."
export AWS_DEFAULT_REGION="$REGION"

if timeout $TIMEOUT mapt aws rhel-ai create \
--project-name "mapt-rhel-ai-${RUN_ID}" \
--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \
--conn-details-output /opt/host-info \
--compute-sizes "$(params.instance-type)" \
--version "$(params.rhelai-version)" \
--auto-start \
--model "$(params.model)" \
--hf-token "${HF_TOKEN}" \
--api-key "${VLLM_API_KEY}" \
--expose-ports 8000 \
--vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \
--tags "project=lightspeed-core,environment=konflux-ci"; then
CREATED=1
break
fi

echo "[mapt] Failed in ${REGION}, cleaning up and trying next..."
mapt aws rhel-ai destroy \
--project-name "mapt-rhel-ai-${RUN_ID}" \
--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \
--force-destroy 2>/dev/null || true
done

if [ "$CREATED" -ne 1 ]; then
echo "[mapt] ERROR: Failed to create instance in any region"
exit 1
fi
fi

echo "[mapt] Instance created and vLLM started."
echo -n "${VLLM_API_KEY}" > $(results.vllm-api-key.path)
echo -n "$(cat /opt/host-info/host)" > $(results.host.path)
echo -n "$(cat /opt/host-info/username)" > $(results.username.path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Spot provisioning failures are not detected.

The script runs with set -uo pipefail (no -e). In the spot branch the exit status of mapt aws rhel-ai create is never checked, unlike the on-demand branch which uses the CREATED guard. If spot creation fails, execution falls through to Lines 168–171, where cat /opt/host-info/host fails (ignored, no -e) and empty host/vllm-api-key results are emitted, causing the e2e task to run against a non-existent endpoint instead of failing fast.

🐛 Proposed fix (spot branch)
                 mapt aws rhel-ai create \
                     ...
-                    --tags "project=lightspeed-core,environment=konflux-ci"
+                    --tags "project=lightspeed-core,environment=konflux-ci" || {
+                      echo "[mapt] ERROR: spot creation failed"; exit 1; }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [[ "$(params.spot)" == "true" ]]; then
export AWS_DEFAULT_REGION="us-east-1"
echo "[mapt] Using spot instances (searching all regions)..."
mapt aws rhel-ai create \
--project-name "mapt-rhel-ai-${RUN_ID}" \
--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \
--conn-details-output /opt/host-info \
--compute-sizes "$(params.instance-type)" \
--version "$(params.rhelai-version)" \
${SPOT_ARGS} \
--auto-start \
--model "$(params.model)" \
--hf-token "${HF_TOKEN}" \
--api-key "${VLLM_API_KEY}" \
--expose-ports 8000 \
--vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \
--tags "project=lightspeed-core,environment=konflux-ci"
else
REGIONS="us-east-1 us-east-2 us-west-2 eu-west-1 eu-central-1 ap-northeast-1"
TIMEOUT=600
CREATED=0
for REGION in $REGIONS; do
echo "[mapt] Trying on-demand in ${REGION}..."
export AWS_DEFAULT_REGION="$REGION"
if timeout $TIMEOUT mapt aws rhel-ai create \
--project-name "mapt-rhel-ai-${RUN_ID}" \
--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \
--conn-details-output /opt/host-info \
--compute-sizes "$(params.instance-type)" \
--version "$(params.rhelai-version)" \
--auto-start \
--model "$(params.model)" \
--hf-token "${HF_TOKEN}" \
--api-key "${VLLM_API_KEY}" \
--expose-ports 8000 \
--vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \
--tags "project=lightspeed-core,environment=konflux-ci"; then
CREATED=1
break
fi
echo "[mapt] Failed in ${REGION}, cleaning up and trying next..."
mapt aws rhel-ai destroy \
--project-name "mapt-rhel-ai-${RUN_ID}" \
--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \
--force-destroy 2>/dev/null || true
done
if [ "$CREATED" -ne 1 ]; then
echo "[mapt] ERROR: Failed to create instance in any region"
exit 1
fi
fi
echo "[mapt] Instance created and vLLM started."
echo -n "${VLLM_API_KEY}" > $(results.vllm-api-key.path)
echo -n "$(cat /opt/host-info/host)" > $(results.host.path)
echo -n "$(cat /opt/host-info/username)" > $(results.username.path)
if [[ "$(params.spot)" == "true" ]]; then
export AWS_DEFAULT_REGION="us-east-1"
echo "[mapt] Using spot instances (searching all regions)..."
mapt aws rhel-ai create \
--project-name "mapt-rhel-ai-${RUN_ID}" \
--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \
--conn-details-output /opt/host-info \
--compute-sizes "$(params.instance-type)" \
--version "$(params.rhelai-version)" \
${SPOT_ARGS} \
--auto-start \
--model "$(params.model)" \
--hf-token "${HF_TOKEN}" \
--api-key "${VLLM_API_KEY}" \
--expose-ports 8000 \
--vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \
--tags "project=lightspeed-core,environment=konflux-ci" || {
echo "[mapt] ERROR: spot creation failed"; exit 1; }
else
REGIONS="us-east-1 us-east-2 us-west-2 eu-west-1 eu-central-1 ap-northeast-1"
TIMEOUT=600
CREATED=0
for REGION in $REGIONS; do
echo "[mapt] Trying on-demand in ${REGION}..."
export AWS_DEFAULT_REGION="$REGION"
if timeout $TIMEOUT mapt aws rhel-ai create \
--project-name "mapt-rhel-ai-${RUN_ID}" \
--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \
--conn-details-output /opt/host-info \
--compute-sizes "$(params.instance-type)" \
--version "$(params.rhelai-version)" \
--auto-start \
--model "$(params.model)" \
--hf-token "${HF_TOKEN}" \
--api-key "${VLLM_API_KEY}" \
--expose-ports 8000 \
--vllm-extra-args "--max-model-len 131072 --enable-auto-tool-choice --tool-call-parser llama3_json --chat-template /opt/app-root/template/tool_chat_template_llama3.1_json.jinja" \
--tags "project=lightspeed-core,environment=konflux-ci"; then
CREATED=1
break
fi
echo "[mapt] Failed in ${REGION}, cleaning up and trying next..."
mapt aws rhel-ai destroy \
--project-name "mapt-rhel-ai-${RUN_ID}" \
--backed-url "s3://${BUCKET}/mapt/rhel-ai/${RUN_ID}" \
--force-destroy 2>/dev/null || true
done
if [ "$CREATED" -ne 1 ]; then
echo "[mapt] ERROR: Failed to create instance in any region"
exit 1
fi
fi
echo "[mapt] Instance created and vLLM started."
echo -n "${VLLM_API_KEY}" > $(results.vllm-api-key.path)
echo -n "$(cat /opt/host-info/host)" > $(results.host.path)
echo -n "$(cat /opt/host-info/username)" > $(results.username.path)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml around
lines 112 - 171, In the spot provisioning path of the Tekton step, the exit
status of `mapt aws rhel-ai create` is not checked, so failures can fall through
and emit empty results. Update the spot branch in the shell block to guard the
`mapt aws rhel-ai create` call the same way the on-demand path uses `CREATED`,
and fail fast with a clear error if creation does not succeed. Keep the fix
localized around the existing `if [[ "$(params.spot)" == "true" ]]` branch and
the subsequent result-writing commands so `host` and `vllm-api-key` are only
written after a successful create.

Comment on lines +348 to +349
- name: run-e2e-tests
onError: continue

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

git ls-files '.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml'
wc -l .tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml
sed -n '300,420p' .tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml
rg -n 'run-e2e-tests|destroy-rhelai|finally|onError|exit \$PIPELINE_EXIT' .tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml

Repository: lightspeed-core/lightspeed-stack

Length of output: 5730


🌐 Web query:

Tekton step onError continue semantics task pipeline run failure continue step exits non-zero

💡 Result:

In Tekton, the onError field allows you to control the behavior of a Task when a step exits with a non-zero exit code [1][2]. By default, if a step fails, the remaining steps in the Task are skipped and the TaskRun is marked as a failure [1][3]. When you set onError: continue for a step, the following semantics apply: 1. Step Execution: If the step fails (i.e., exits with a non-zero code), Tekton does not stop the TaskRun [1][2]. Instead, it proceeds to execute the subsequent steps in the Task [1][4]. 2. Exit Code Visibility: Although the step failure is ignored for the purpose of halting the TaskRun, the original exit code is preserved [1][2]. You can access this exit code by reading the file specified by the path variable $(steps.step-.exitCode.path) or $(steps.step-unnamed-.exitCode.path) [5][6]. 3. Task Results: If a step fails but manages to initialize a result file before exiting, that result is made available to subsequent tasks [1][3]. However, if the step fails before creating a result that a downstream task expects, the pipeline may fail with an InvalidTaskResultReference error [3][2]. 4. Precedence: The onError behavior is overridden if a breakpoint is active for that step [1][4]. Debugging features allow you to manually dictate whether a step should be treated as a success or failure, which takes precedence over the onError configuration [1][7]. This functionality is specific to the step level within a Task [1][8]. It is distinct from the onError semantics available at the Pipeline level, which controls whether an entire PipelineTask failure stops the overall PipelineRun [8].

Citations:


Remove onError: continue from run-e2e-tests onError: continue keeps the task/pipeline green even when exit $PIPELINE_EXIT is non-zero. destroy-rhelai already runs in finally, so this flag only masks e2e failures.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml around
lines 348 - 349, Remove the onError: continue setting from the run-e2e-tests
task so failures are not masked when PIPELINE_EXIT is non-zero. Update the task
definition in lightspeed-stack-rhelai-test.yaml for run-e2e-tests, and keep
destroy-rhelai in finally as the cleanup path so the pipeline correctly fails on
e2e errors.

Comment on lines +395 to +399
REPO_URL=$(jq -r '.components[] | select(.name == "lightspeed-stack") | .source.git.url // "https://github.com/lightspeed-core/lightspeed-stack.git"' <<< "$SNAPSHOT")
REPO_REV=$(jq -r '.components[] | select(.name == "lightspeed-stack") | .source.git.revision // "main"' <<< "$SNAPSHOT")
# TODO: remove branch override once merged to main
REPO_URL="https://github.com/are-ces/lightspeed-stack.git"
REPO_REV="rhelai-konflux"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔴 Critical | ⚡ Quick win

Hardcoded fork/branch override ignores SNAPSHOT and tests unintended code.

Lines 398–399 unconditionally reset REPO_URL/REPO_REV to a personal fork (are-ces/lightspeed-stack.git @ rhelai-konflux), discarding the values just derived from $SNAPSHOT (Lines 395–396). As written, the pipeline always tests the fork rather than the component under test, which must be reverted before merge per the existing TODO.

Would you like me to open an issue to track removing this override before merge?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.tekton/integration-tests/pipeline/lightspeed-stack-rhelai-test.yaml around
lines 395 - 399, The lightspeed-stack repo/revision selection in the Tekton test
script is being overwritten by a hardcoded fork and branch, so the pipeline
ignores the SNAPSHOT-derived values. Remove the unconditional REPO_URL and
REPO_REV reassignment in the lightspeed-stack test step and keep using the
values parsed from SNAPSHOT in that block, leaving any temporary override behind
the existing TODO only if it is explicitly gated for local use. Reference the
REPO_URL and REPO_REV assignments in the lightspeed-stack pipeline step when
updating this logic.

[[ -n "$QUAY_ROBOT_NAME" ]] && log "✅ QUAY_ROBOT_NAME is set" || { echo "❌ Missing QUAY_ROBOT_NAME"; exit 1; }
[[ -n "$QUAY_ROBOT_PASSWORD" ]] && log "✅ QUAY_ROBOT_PASSWORD is set" || { echo "❌ Missing QUAY_ROBOT_PASSWORD"; exit 1; }
[[ -n "$OPENAI_API_KEY" ]] && log "✅ OPENAI_API_KEY is set" || { echo "❌ Missing OPENAI_API_KEY"; exit 1; }
[[ -n "${OPENAI_API_KEY:-}" ]] && log "✅ OPENAI_API_KEY is set" || { echo "❌ Missing OPENAI_API_KEY"; exit 1; }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

&& log || { exit 1 } can falsely fail when QUIET=1.

log() returns the status of [ "$QUIET" != "1" ], which is non-zero whenever QUIET=1. In that case the && log "…" arm returns non-zero and the || arm executes, printing ❌ Missing OPENAI_API_KEY and exiting 1 even though the key is set. Use an explicit if to decouple validation from the side-effecting log.

🐛 Proposed fix
-[[ -n "${OPENAI_API_KEY:-}" ]] && log "✅ OPENAI_API_KEY is set" || { echo "❌ Missing OPENAI_API_KEY"; exit 1; }
+if [[ -n "${OPENAI_API_KEY:-}" ]]; then
+  log "✅ OPENAI_API_KEY is set"
+else
+  echo "❌ Missing OPENAI_API_KEY"; exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
[[ -n "${OPENAI_API_KEY:-}" ]] && log "✅ OPENAI_API_KEY is set" || { echo "❌ Missing OPENAI_API_KEY"; exit 1; }
if [[ -n "${OPENAI_API_KEY:-}" ]]; then
log "✅ OPENAI_API_KEY is set"
else
echo "❌ Missing OPENAI_API_KEY"; exit 1
fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e-prow/rhoai/pipeline-konflux.sh` at line 54, The OPENAI_API_KEY
check in pipeline-konflux.sh is incorrectly tied to log()’s return value, so
`QUIET=1` can trigger the failure path even when the key exists. Update the
validation near the OPENAI_API_KEY guard to use an explicit conditional instead
of `&& ... || ...`, and keep the existence check separate from the `log` side
effect so `log()` cannot influence the exit behavior.

@are-ces are-ces marked this pull request as draft June 30, 2026 10:38
are-ces and others added 3 commits June 30, 2026 14:54
- Update run-rhelai.yaml: use base_url, VLLM_* env vars, restore comments
- Add lightspeed-stack-rhelai.yaml: LCS config with vllm provider
- Sync examples/vllm-rhelai.yaml with test config
- Parameterize pipeline-konflux.sh for LLAMA_STACK_CONFIG, LCS_CONFIG,
  VLLM_URL, VLLM_MODEL, VLLM_API_KEY
- Add optional VLLM env vars to pod manifests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Provisions a RHEL AI GPU instance via MAPT and an ephemeral OpenShift
cluster, deploys lightspeed-stack with vLLM as inference provider, and
runs the full behave e2e test suite.

- OIDC federation for AWS auth (no static keys)
- On-demand with region fallback (spot available via param)
- Per-run S3 state isolation using PipelineRun name
- Random API key per run for vLLM authentication
- Tool calling via --vllm-extra-args
- RHEL AI 3.4.0 GA, Llama-3.1-8B-Instruct, 131072 context window

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document MAPT usage, S3 state bucket, instance provisioning,
GPU requirements, and AMI version management.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant