RAG Configuration Guide

This document explains how to configure and customize your RAG pipeline. You will:

Initialize a vector store
Download and point to a local embedding model
Configure an inference provider (LLM)
Choose a RAG strategy (Inline RAG or Tool RAG)

Introduction
Prerequisites
- Set Up the Vector Database
- Download an Embedding Model
Configure BYOK Knowledge Sources
Add an Inference Model (LLM)
Complete Configuration Reference
System Prompt Guidance for RAG (as a tool)
RAG annotations
References

Introduction

Lightspeed Core Stack (LCS) supports two complementary RAG strategies:

Inline RAG: context is fetched from BYOK vector stores and/or OKP and injected before the LLM request. No tool calls are required.
Tool RAG: the LLM can call the file_search tool during generation to retrieve context on demand from BYOK vector stores and/or OKP.

Both strategies can be enabled independently via the rag section of lightspeed-stack.yaml. See BYOK Feature Documentation for configuration details.

The Embedding Model is used to convert queries and documents into vector representations for similarity matching.

Note

The same Embedding Model should be used to both create the vector store and to query it.

Prerequisites

Set Up the Vector Database

Use the rag-content repository to build a compatible vector database.

Important

The resulting DB must be in a supported format (e.g., FAISS with SQLite metadata). This can be configured when using the tool to generate the index.

Download an Embedding Model

Download a local embedding model such as sentence-transformers/all-mpnet-base-v2 by using the script in rag-content or manually download and place in your desired path.

Note

The embedding model can also be downloaded automatically at first start-up (which will be slower). In the byok_rag section of lightspeed-stack.yaml, specify a supported model name as embedding_model instead of a local path. The model will be downloaded to the ~/.cache/huggingface/hub folder.

Configure BYOK Knowledge Sources

BYOK knowledge sources are configured in the byok_rag section of lightspeed-stack.yaml. The required configuration is automatically generated at startup when using make run, make run-stack, docker-compose, or library mode — no manual enrichment is needed.

FAISS example

byok_rag:
  - rag_id: custom-index
    rag_type: inline::faiss
    embedding_model: sentence-transformers/all-mpnet-base-v2  # or path to local model
    embedding_dimension: 768
    vector_db_id: vs_8c94967b-81cc-4028-a294-9cfac6fd9ae2                                    # Generated by rag-content during index creation
    db_path: <path-to-vector-index>                            # e.g. /home/USER/vector_db/faiss_store.db

Where:

embedding_model is the embedding model identifier or path to the local model folder
db_path is the path to the vector index (.db file in this case)
vector_db_id is the ID generated by rag-content during index creation (e.g. vs_8c94967b-81cc-4028-a294-9cfac6fd9ae2)

See the full working config example for more details.

pgvector example

This example shows how to configure a remote PostgreSQL database with the pgvector extension for storing embeddings.

You will need to install PostgreSQL with a matching version to pgvector, then log in with psql and enable the extension with:
CREATE EXTENSION IF NOT EXISTS vector;

Each pgvector-backed table follows this schema:

id (text): UUID identifier of the chunk
document (jsonb): json containing content and metadata associated with the embedding
embedding (vector(n)): the embedding vector, where n is the embedding dimension and will match the model's output size (e.g. 768 for all-mpnet-base-v2)

Note

The vector_store_id (e.g. rhdocs) is used to point to the table named vector_store_rhdocs in the specified database, which stores the vector embeddings.

byok_rag:
  - rag_id: pgvector-example
    rag_type: remote::pgvector
    embedding_model: sentence-transformers/all-mpnet-base-v2
    embedding_dimension: 768
    vector_db_id: rhdocs  # becomes PostgreSQL table 'vector_store_rhdocs'
    host: ${env.POSTGRES_HOST}
    port: ${env.POSTGRES_PORT}
    db: ${env.POSTGRES_DATABASE}
    user: ${env.POSTGRES_USER}
    password: ${env.POSTGRES_PASSWORD}

Note

Connection fields (host, port, db, user, password) default to ${env.POSTGRES_*} environment variable references when omitted. Use environment variables for credentials.

Add an Inference Model (LLM)

vLLM on RHEL AI (Llama 3.1) example

Note

The following example assumes that podman's CDI has been properly configured to enable GPU support.

The vllm-openai Docker image is used to serve the Llama-3.1-8B-Instruct model.
The following example shows how to run it on RHEL AI with podman:

podman run \
  --device "${CONTAINER_DEVICE}" \
  --gpus ${GPUS} \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
  -p ${EXPORTED_PORT}:8000 \
  --ipc=host \
  docker.io/vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json --chat-template examples/tool_chat_template_llama3.1_json.jinja

The example command above enables tool calling for Llama 3.1 models. For other supported models and configuration options, see the vLLM documentation: vLLM: Tool Calling

After starting the container, configure the vLLM provider in your run.yaml, matching model_id with the model provided in the podman run command.

[...]
models:
[...]
- model_id: meta-llama/Llama-3.1-8B-Instruct # Same as the model name in the 'podman run' command
  provider_id: vllm
  model_type: llm
  provider_model_id: null

providers:
  [...]
  inference:
  - provider_id: vllm
    provider_type: remote::vllm
    config:
      url: http://localhost:${env.EXPORTED_PORT:=8000}/v1/ # Replace localhost with the url of the vLLM instance
      api_token: <your-key-here> # if any

OpenAI example

Add a provider for your language model in your run.yaml (e.g., OpenAI):

models:
[...]
- model_id: my-model 
  provider_id: openai
  model_type: llm
  provider_model_id: <model-name> # e.g. gpt-4o-mini

providers:
[...]
  inference:
  - provider_id: openai
    provider_type: remote::openai
    config:
      api_key: ${env.OPENAI_API_KEY}

Make sure to export your API key:

export OPENAI_API_KEY=<your-key-here>

Note

When experimenting with different models, providers and vector_dbs, you might need to manually unregister the old ones via the CLI.

Azure OpenAI

Not yet supported.

Ollama

The remote::ollama provider does not support tool calling, so RAG as a tool is not available. However, inline RAG is supported.

vLLM Mistral

The RAG tool calls where not working properly when experimenting with mistralai/Mistral-7B-Instruct-v0.3 on vLLM.

OKP/Solr Vector IO

The OKP (Offline Knowledge Portal) Solr Vector IO is a read-only vector search provider that integrates with Apache Solr for enhanced vector search capabilities. It enables retrieving contextual information from Solr-indexed Red Hat documents to enhance query responses with support for hybrid search and chunk window expansion.

How to Enable OKP/Solr Vector IO

1. Configure Lightspeed Stack (lightspeed-stack.yaml):

rag:
  inline:
    - okp               # inject OKP context before the LLM request
  tool:
    - okp               # expose OKP as the file_search tool

okp:
  rhokp_url: ${env.RH_SERVER_OKP}   # OKP base URL (env var or literal URL)
  offline: true         # true = use parent_id for source URLs (offline mode)
                        # false = use reference_url (online mode)

Set rhokp_url to the base URL of your OKP server. Use ${env.RH_SERVER_OKP} to read the URL from the environment; when omitted or empty, a default from the application constants is used.

Note

When okp is listed in rag.inline or rag.tool, Lightspeed Stack automatically enriches the underlying configuration at startup with the required vector_io provider and registered_resources entries for the OKP vector store. No manual registration is needed.

Query Request Example:

curl -sX POST http://localhost:8080/v1/query \
    -H "Content-Type: application/json" \
    -d '{"query" : "how do I secure a nodejs application with keycloak?"}' | jq .

Query Processing:

When OKP is enabled, queries use the portal-rag vector store
Vector search is performed with configurable parameters:
- k: Number of results (default: 5)
- score_threshold: Minimum similarity score (default: 0.0)
- mode: Search mode (default: "hybrid"). Per-request configurable.
Results include document metadata and source URLs
Document URLs are built based on the offline setting:
- Offline mode: Uses parent_id with Mimir base URL
- Online mode: Uses reference_url from document metadata

Query Filtering:

To further filter the OKP context, set the chunk_filter_query field in the okp section of lightspeed-stack.yaml. Filters follow the OKP key:value format and are applied as a static fq parameter on every OKP search request.

okp:
  rhokp_url: ${env.RH_SERVER_OKP}
  chunk_filter_query: "product:*openshift*"

Per-request filtering is also available on all inference endpoints via request field solr: mode (semantic, hybrid, or lexical) and filters (key:value format). Legacy payloads that omit mode/filters and send filter key:value pairs at the top level still work with mode set to hybrid.

Example:

{
  "query": "How do I configure routes?",
  "solr": {
    "mode": "hybrid",
    "filters": { "fq": ["product:*openshift*"] }
  }
}

Prerequisites:

The OKP server must be running and accessible at the URL given in okp.rhokp_url (or ${env.RH_SERVER_OKP}). For instructions on how to pull and run the OKP image, visit: https://github.com/lightspeed-core/lightspeed-providers/lightspeed_stack_providers/providers/remote/solr_vector_io/solr_vector_io/README.md

Chunk volume:

OKP and BYOK scores are not directly comparable (different scoring systems), so score_multiplier (a BYOK-only concept) does not apply to OKP results. To control the number of retrieved chunks, set the constants in src/constants.py:

Constant	Value	Description
`INLINE_RAG_MAX_CHUNKS`	10	Hard upper bound on the final merged inline RAG chunks (BYOK + OKP) delivered to the LLM
`OKP_RAG_MAX_CHUNKS`	5	Fetch hint for OKP (Inline RAG); controls how many chunks enter the reranking pool
`BYOK_RAG_MAX_CHUNKS`	10	Fetch hint for BYOK stores (Inline RAG); controls how many chunks enter the reranking pool
`TOOL_RAG_MAX_CHUNKS`	10	Max chunks retrieved via Tool RAG (`file_search`); independent from `INLINE_RAG_MAX_CHUNKS`

Limitations:

This is a read-only provider - no insert/delete operations

Complete Configuration Reference

To enable RAG functionality, configure the byok_rag and rag sections in your lightspeed-stack.yaml.

Below is an example of a working lightspeed-stack.yaml configuration with:

A local all-mpnet-base-v2 embedding model
A FAISS-based vector store
Inline and Tool RAG enabled

Tip

We recommend starting with a minimal working configuration and extending it as needed.

name: Lightspeed Core Service (LCS)
service:
  host: localhost
  port: 8080
  auth_enabled: false

byok_rag:
  - rag_id: ocp-docs
    rag_type: inline::faiss
    embedding_model: sentence-transformers/all-mpnet-base-v2
    embedding_dimension: 768
    vector_db_id: vs_3a7f9b2e-45dc-4e1a-b8f2-1c9d0e3f5a6b
    db_path: /home/USER/lightspeed-stack/vector_dbs/ocp_docs/faiss_store.db

rag:
  inline:
    - ocp-docs
  tool:
    - ocp-docs

The BYOK vector store providers and registered resources are automatically generated at startup from the byok_rag entries above. Models and inference providers must be configured separately in your run.yaml.

System Prompt Guidance for RAG (as a tool)

When using RAG, the knowledge_search tool must be explicitly referenced in your system prompt. Without clear instructions, models may inconsistently use the tool.

Tool-Aware sample instruction:

You are a helpful assistant with access to a 'knowledge_search' tool. When users ask questions, ALWAYS use the knowledge_search tool first to find accurate information from the documentation before answering.

RAG annotations

The top-level vector_stores block in run.yaml may include annotation_prompt_params to control whether extra RAG annotation instructions are injected into the model prompt (for example, citation-style markers). The default configuration sets enable_annotations: false under that block to avoid unwanted annotations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RAG Configuration Guide

Table of Contents

Introduction

Prerequisites

Set Up the Vector Database

Download an Embedding Model

Configure BYOK Knowledge Sources

FAISS example

pgvector example

Add an Inference Model (LLM)

vLLM on RHEL AI (Llama 3.1) example

OpenAI example

Azure OpenAI

Ollama

vLLM Mistral

OKP/Solr Vector IO

How to Enable OKP/Solr Vector IO

Complete Configuration Reference

System Prompt Guidance for RAG (as a tool)

RAG annotations

FilesExpand file tree

rag_guide.md

Latest commit

History

rag_guide.md

File metadata and controls

RAG Configuration Guide

Table of Contents

Introduction

Prerequisites

Set Up the Vector Database

Download an Embedding Model

Configure BYOK Knowledge Sources

FAISS example

pgvector example

Add an Inference Model (LLM)

vLLM on RHEL AI (Llama 3.1) example

OpenAI example

Azure OpenAI

Ollama

vLLM Mistral

OKP/Solr Vector IO

How to Enable OKP/Solr Vector IO

Complete Configuration Reference

System Prompt Guidance for RAG (as a tool)

RAG annotations