Skip to content

Commit 2606037

Browse files
authored
Merge pull request #1871 from max-svistunov/lcore-2093-2094-byok-pdf-e2e-and-docs
LCORE-2093: e2e test for PDF-built BYOK vector store + BYOK guide docs (LCORE-2094)
2 parents aa29526 + 6879be3 commit 2606037

8 files changed

Lines changed: 165 additions & 14 deletions

File tree

docs/byok_guide.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -105,9 +105,9 @@ Before implementing BYOK, ensure you have:
105105
- **LLM Provider**: OpenAI, vLLM, or other supported inference provider
106106

107107
### Knowledge Sources
108-
- **Directly supported**: Markdown (.md) and plain text (.txt) files
109-
- **Requires conversion**: PDFs, AsciiDoc, HTML, and other formats must be converted to markdown or TXT
110-
- Documentation, manuals, FAQs, knowledge bases (after format conversion)
108+
- **Directly supported**: Markdown (`.md`), plain text (`.txt`), PDF (`.pdf`), and HTML (`.html`/`.htm`) files. PDF and HTML are converted to Markdown automatically by [`rag-content`](https://github.com/lightspeed-core/rag-content) (via docling) — no manual pre-conversion step is needed. See the rag-content README's [Supported Input Formats](https://github.com/lightspeed-core/rag-content#supported-input-formats) section.
109+
- **Requires conversion**: AsciiDoc and other formats must be converted to Markdown or plain text first.
110+
- Documentation, manuals, FAQs, knowledge bases
111111

112112
---
113113

@@ -116,11 +116,10 @@ Before implementing BYOK, ensure you have:
116116
### Step 1: Prepare Your Knowledge Sources
117117

118118
1. **Collect your documents**: Gather all knowledge sources you want to include
119-
2. **Convert formats**: Convert non-supported formats to markdown (.md) or plain text (.txt)
120-
- **PDF conversion**: Use tools like [docling](https://github.com/DS4SD/docling) to convert PDFs to markdown
121-
- **Adoc conversion**: Use [custom scripts](https://github.com/openshift/lightspeed-rag-content/blob/main/scripts/asciidoctor-text/convert-it-all.py) to convert AsciiDoc to plain text
122-
3. **Organize content**: Structure your converted documents for optimal indexing
123-
4. **Format validation**: Ensure all documents are in supported formats (.md or .txt)
119+
2. **Markdown, text, PDF, and HTML ingest directly**: Place `.md`, `.txt`, `.pdf`, and `.html` files in your input directory and pass them to `rag-content` with the matching document type (`-t pdf` or `-t html`). `rag-content` converts PDF and HTML to Markdown for you via docling — no manual pre-conversion step is required.
120+
- **PDF note**: Scanned / image-only PDFs are out of scope (OCR is disabled); they index as empty and `rag-content` logs a warning naming the file. Run such PDFs through a separate OCR step first.
121+
- **AsciiDoc and other formats**: Convert to Markdown or plain text first — e.g. use [custom scripts](https://github.com/openshift/lightspeed-rag-content/blob/main/scripts/asciidoctor-text/convert-it-all.py) for AsciiDoc.
122+
3. **Organize content**: Structure your documents for optimal indexing
124123

125124
### Step 2: Create Vector Database
126125

@@ -150,7 +149,7 @@ class CustomMetadataProcessor(MetadataProcessor):
150149
**Important Notes:**
151150
- Supported formats:
152151
- Faiss Vector-IO
153-
- The same embedding model must be used for both creation and querying
152+
- **The embedding model (and its dimension) used to *build* the vector store must exactly match the one configured for querying** in the `byok_rag` section (see Step 3). A mismatch does not raise an error — it silently returns no or irrelevant results, because the query vector and the stored vectors are then incomparable. The default is `sentence-transformers/all-mpnet-base-v2` (dimension `768`).
154153

155154
### Step 3: Configure Embedding Model
156155

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
name: Lightspeed Core Service (LCS)
2+
service:
3+
host: 0.0.0.0
4+
port: 8080
5+
auth_enabled: false
6+
workers: 1
7+
color_log: true
8+
access_log: true
9+
llama_stack:
10+
use_as_library_client: true
11+
library_client_config_path: run.yaml
12+
user_data_collection:
13+
feedback_enabled: true
14+
feedback_storage: "/tmp/data/feedback"
15+
transcripts_enabled: true
16+
transcripts_storage: "/tmp/data/transcripts"
17+
18+
conversation_cache:
19+
type: "sqlite"
20+
sqlite:
21+
db_path: "/tmp/data/conversation-cache.db"
22+
23+
authentication:
24+
module: "noop"
25+
inference:
26+
default_provider: openai
27+
default_model: gpt-4o-mini
28+
29+
# BYOK store built from a PDF by rag-content's `pdf` module (LCORE-2091).
30+
# The vector_db_id is hardcoded to the id baked into the committed fixture
31+
# store (tests/e2e/rag/pdf_kv_store.db) so this feature is self-contained and
32+
# needs no externally-provisioned vector-store id. See tests/e2e/rag/README.md
33+
# for how the fixture was produced.
34+
byok_rag:
35+
- rag_id: pdf-field-notes
36+
rag_type: inline::faiss
37+
embedding_model: sentence-transformers/all-mpnet-base-v2
38+
embedding_dimension: 768
39+
vector_db_id: vs_4a27375c-b8da-4134-96fc-b8198d111015
40+
db_path: ${env.PDF_KV_RAG_PATH:=~/.llama/storage/rag/pdf_kv_store.db}
41+
score_multiplier: 1.0
42+
43+
rag:
44+
inline:
45+
- pdf-field-notes
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
@e2e_group_3 @skip-in-server-mode
2+
Feature: BYOK PDF support tests
3+
4+
# Validates that a vector store built from a PDF by rag-content's `pdf`
5+
# module (LCORE-2091) is consumed correctly by lightspeed-stack: the BYOK
6+
# source is registered and a query retrieves content that exists only in the
7+
# source PDF. The fixture store (tests/e2e/rag/pdf_kv_store.db) holds a single
8+
# deliberately-fabricated fact, so a correct answer can only come from the
9+
# store, not from the LLM's own knowledge.
10+
11+
Background:
12+
Given The service is started locally
13+
And The system is in default state
14+
And I set the Authorization header to Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6Ikpva
15+
And REST API service prefix is /v1
16+
And the Lightspeed stack configuration directory is "tests/e2e/configuration"
17+
And The service uses the lightspeed-stack-byok-pdf.yaml configuration
18+
And The service is restarted
19+
20+
Scenario: PDF-built inline RAG source is registered
21+
When I access REST API endpoint rags using HTTP GET method
22+
Then The status code of the response is 200
23+
And the body of the response has the following structure
24+
"""
25+
{
26+
"rags": [
27+
"pdf-field-notes"
28+
]
29+
}
30+
"""
31+
32+
Scenario: Query retrieves content sourced from the PDF
33+
When I use "query" to ask question with authorization header
34+
"""
35+
{"query": "According to the field notes, what is the name of the mascot of Red Hat Lightspeed?", "system_prompt": "You are an assistant. Answer only from the provided context. Write only lowercase letters", "model": "{MODEL}", "provider": "{PROVIDER}"}
36+
"""
37+
Then The status code of the response is 200
38+
And The response contains following fragments
39+
| Fragments in LLM response |
40+
| zephyr |
41+
And The response contains non-empty rag_chunks
42+
43+
Scenario: Streaming query retrieves content sourced from the PDF
44+
When I use "streaming_query" to ask question with authorization header
45+
"""
46+
{"query": "According to the field notes, what is the name of the mascot of Red Hat Lightspeed?", "system_prompt": "You are an assistant. Answer only from the provided context. Write only lowercase letters", "model": "{MODEL}", "provider": "{PROVIDER}"}
47+
"""
48+
Then The status code of the response is 200
49+
And I wait for the response to be completed
50+
And The streamed response contains following fragments
51+
| Fragments in LLM response |
52+
| zephyr |

tests/e2e/features/environment.py

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -196,10 +196,10 @@ def before_scenario(context: Context, scenario: Scenario) -> None:
196196
resetting per-scenario Lightspeed override tracking and skip-restart flags.
197197
198198
Skips the scenario if it has the `skip` tag, if it has the `local` tag
199-
while the test run is not in local mode, or if it has
200-
`skip-in-library-mode` when running in library mode. Scenario-specific
201-
Lightspeed YAML is applied in the feature files (``The service uses the
202-
... configuration`` steps).
199+
while the test run is not in local mode, if it has `skip-in-library-mode`
200+
when running in library mode, or if it has `skip-in-server-mode` when running
201+
in server mode. Scenario-specific Lightspeed YAML is applied in the feature
202+
files (``The service uses the ... configuration`` steps).
203203
"""
204204
if "skip" in scenario.effective_tags:
205205
scenario.skip("Marked with @skip")
@@ -213,6 +213,17 @@ def before_scenario(context: Context, scenario: Scenario) -> None:
213213
scenario.skip("Skipped in library mode (no separate llama-stack container)")
214214
return
215215

216+
# Skip scenarios that rely on a non-default BYOK store. Only library mode
217+
# re-enriches the (in-process) llama-stack with the active config's byok_rag
218+
# on restart; in server mode the external llama-stack keeps its startup
219+
# config, so a feature-specific store would not be loaded.
220+
if not context.is_library_mode and "skip-in-server-mode" in scenario.effective_tags:
221+
scenario.skip(
222+
"Skipped in server mode (feature-specific BYOK store is not loaded "
223+
"into the external llama-stack)"
224+
)
225+
return
226+
216227
# Skip scenarios that depend on services not deployed in Prow/OpenShift
217228
# (e.g. mock-tls-inference, proxy sidecars only available in Docker Compose)
218229
if is_prow_environment() and "skip-in-prow" in scenario.effective_tags:

tests/e2e/rag/README.md

Lines changed: 44 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,45 @@
1-
# List of source files stored in `tests/e2e/rag` directory
1+
# Vector stores in `tests/e2e/rag`
22

3+
This directory holds committed BYOK vector stores used by the e2e suite.
4+
5+
## `kv_store.db`
6+
7+
Faiss BYOK store used by `faiss.feature` and `inline_rag.feature` (the
8+
`e2e-test-docs` source). Consumed via the `FAISS_VECTOR_STORE_ID` /
9+
`KV_RAG_PATH` environment variables.
10+
11+
## `pdf_kv_store.db` (+ `sources/lightspeed-field-notes.pdf`)
12+
13+
Faiss BYOK store used by `byok_pdf.feature` (the `pdf-field-notes` source).
14+
It is **built from a PDF** by `rag-content`'s `pdf` module (LCORE-2091) to prove
15+
that a PDF-sourced vector store is consumed correctly by lightspeed-stack.
16+
17+
- Source PDF: `sources/lightspeed-field-notes.pdf` — a tiny document containing a
18+
single deliberately-fabricated fact (a "purple penguin named Zephyr"), so a
19+
correct query answer can only come from the store, not the LLM's knowledge.
20+
- Embedding model: `sentence-transformers/all-mpnet-base-v2` (dimension 768).
21+
- Baked-in `vector_db_id`: `vs_4a27375c-b8da-4134-96fc-b8198d111015`
22+
(hardcoded in `lightspeed-stack-byok-pdf.yaml`, so the feature is
23+
self-contained and needs no externally-provisioned store id).
24+
25+
### Reproduce
26+
27+
From a `rag-content` checkout with PDF support (LCORE-2091):
28+
29+
```bash
30+
python scripts/generate_embeddings.py \
31+
-f <dir containing the PDF> \
32+
-o <out dir> \
33+
-i pdf-field-notes \
34+
-m sentence-transformers/all-mpnet-base-v2 \
35+
-d <embeddings model dir> \
36+
-s llamastack-faiss \
37+
-t pdf
38+
# -> <out dir>/faiss_store.db (copied here as pdf_kv_store.db)
39+
```
40+
41+
> Note: docling (used by the PDF reader) loads its own models from the Hugging
42+
> Face cache, but `DocumentProcessor` forces `HF_HOME` to the embeddings-model
43+
> dir. Until that is fixed (tracked separately), make docling's models reachable
44+
> by symlinking a populated `hub` into the embeddings-model dir
45+
> (`ln -s ~/.cache/huggingface/hub <embeddings model dir>/hub`).

tests/e2e/rag/pdf_kv_store.db

32 KB
Binary file not shown.
911 Bytes
Binary file not shown.

tests/e2e/test_list.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ features/conversations.feature
1010
features/prompts.feature
1111
features/faiss.feature
1212
features/inline_rag.feature
13+
features/byok_pdf.feature
1314
features/feedback.feature
1415
features/query.feature
1516
features/responses.feature

0 commit comments

Comments
 (0)