Merge pull request #1871 from max-svistunov/lcore-2093-2094-byok-pdf-e2e-and-docs

tisnik · web-flow · commit 26060372e1c2 · 2026-06-09T09:38:35.000+02:00
LCORE-2093: e2e test for PDF-built BYOK vector store + BYOK guide docs (LCORE-2094)
diff --git a/docs/byok_guide.md b/docs/byok_guide.md
@@ -105,9 +105,9 @@ Before implementing BYOK, ensure you have:
 - **LLM Provider**: OpenAI, vLLM, or other supported inference provider
 
 ### Knowledge Sources
-- **Directly supported**: Markdown (.md) and plain text (.txt) files
-- **Requires conversion**: PDFs, AsciiDoc, HTML, and other formats must be converted to markdown or TXT
-- Documentation, manuals, FAQs, knowledge bases (after format conversion)
+- **Directly supported**: Markdown (`.md`), plain text (`.txt`), PDF (`.pdf`), and HTML (`.html`/`.htm`) files. PDF and HTML are converted to Markdown automatically by [`rag-content`](https://github.com/lightspeed-core/rag-content) (via docling) — no manual pre-conversion step is needed. See the rag-content README's [Supported Input Formats](https://github.com/lightspeed-core/rag-content#supported-input-formats) section.
+- **Requires conversion**: AsciiDoc and other formats must be converted to Markdown or plain text first.
+- Documentation, manuals, FAQs, knowledge bases
 
 ---
 
@@ -116,11 +116,10 @@ Before implementing BYOK, ensure you have:
 ### Step 1: Prepare Your Knowledge Sources
 
 1. **Collect your documents**: Gather all knowledge sources you want to include
-2. **Convert formats**: Convert non-supported formats to markdown (.md) or plain text (.txt)
-   - **PDF conversion**: Use tools like [docling](https://github.com/DS4SD/docling) to convert PDFs to markdown
-   - **Adoc conversion**: Use [custom scripts](https://github.com/openshift/lightspeed-rag-content/blob/main/scripts/asciidoctor-text/convert-it-all.py) to convert AsciiDoc to plain text
-3. **Organize content**: Structure your converted documents for optimal indexing
-4. **Format validation**: Ensure all documents are in supported formats (.md or .txt)
+2. **Markdown, text, PDF, and HTML ingest directly**: Place `.md`, `.txt`, `.pdf`, and `.html` files in your input directory and pass them to `rag-content` with the matching document type (`-t pdf` or `-t html`). `rag-content` converts PDF and HTML to Markdown for you via docling — no manual pre-conversion step is required.
+   - **PDF note**: Scanned / image-only PDFs are out of scope (OCR is disabled); they index as empty and `rag-content` logs a warning naming the file. Run such PDFs through a separate OCR step first.
+   - **AsciiDoc and other formats**: Convert to Markdown or plain text first — e.g. use [custom scripts](https://github.com/openshift/lightspeed-rag-content/blob/main/scripts/asciidoctor-text/convert-it-all.py) for AsciiDoc.
+3. **Organize content**: Structure your documents for optimal indexing
 
 ### Step 2: Create Vector Database
 
@@ -150,7 +149,7 @@ class CustomMetadataProcessor(MetadataProcessor):
 **Important Notes:**
 - Supported formats: 
   - Faiss Vector-IO
-- The same embedding model must be used for both creation and querying
+- **The embedding model (and its dimension) used to *build* the vector store must exactly match the one configured for querying** in the `byok_rag` section (see Step 3). A mismatch does not raise an error — it silently returns no or irrelevant results, because the query vector and the stored vectors are then incomparable. The default is `sentence-transformers/all-mpnet-base-v2` (dimension `768`).
 
 ### Step 3: Configure Embedding Model
 
diff --git a/tests/e2e/configuration/library-mode/lightspeed-stack-byok-pdf.yaml b/tests/e2e/configuration/library-mode/lightspeed-stack-byok-pdf.yaml
@@ -0,0 +1,45 @@
+name: Lightspeed Core Service (LCS)
+service:
+  host: 0.0.0.0
+  port: 8080
+  auth_enabled: false
+  workers: 1
+  color_log: true
+  access_log: true
+llama_stack:
+  use_as_library_client: true
+  library_client_config_path: run.yaml
+user_data_collection:
+  feedback_enabled: true
+  feedback_storage: "/tmp/data/feedback"
+  transcripts_enabled: true
+  transcripts_storage: "/tmp/data/transcripts"
+
+conversation_cache:
+  type: "sqlite"
+  sqlite:
+    db_path: "/tmp/data/conversation-cache.db"
+
+authentication:
+  module: "noop"
+inference:
+  default_provider: openai
+  default_model: gpt-4o-mini
+
+# BYOK store built from a PDF by rag-content's `pdf` module (LCORE-2091).
+# The vector_db_id is hardcoded to the id baked into the committed fixture
+# store (tests/e2e/rag/pdf_kv_store.db) so this feature is self-contained and
+# needs no externally-provisioned vector-store id. See tests/e2e/rag/README.md
+# for how the fixture was produced.
+byok_rag:
+  - rag_id: pdf-field-notes
+    rag_type: inline::faiss
+    embedding_model: sentence-transformers/all-mpnet-base-v2
+    embedding_dimension: 768
+    vector_db_id: vs_4a27375c-b8da-4134-96fc-b8198d111015
+    db_path: ${env.PDF_KV_RAG_PATH:=~/.llama/storage/rag/pdf_kv_store.db}
+    score_multiplier: 1.0
+
+rag:
+  inline:
+    - pdf-field-notes
diff --git a/tests/e2e/features/byok_pdf.feature b/tests/e2e/features/byok_pdf.feature
@@ -0,0 +1,52 @@
+@e2e_group_3 @skip-in-server-mode
+Feature: BYOK PDF support tests
+
+  # Validates that a vector store built from a PDF by rag-content's `pdf`
+  # module (LCORE-2091) is consumed correctly by lightspeed-stack: the BYOK
+  # source is registered and a query retrieves content that exists only in the
+  # source PDF. The fixture store (tests/e2e/rag/pdf_kv_store.db) holds a single
+  # deliberately-fabricated fact, so a correct answer can only come from the
+  # store, not from the LLM's own knowledge.
+
+  Background:
+    Given The service is started locally
+      And The system is in default state
+      And I set the Authorization header to Bearer eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6Ikpva
+      And REST API service prefix is /v1
+      And the Lightspeed stack configuration directory is "tests/e2e/configuration"
+      And The service uses the lightspeed-stack-byok-pdf.yaml configuration
+      And The service is restarted
+
+  Scenario: PDF-built inline RAG source is registered
+    When I access REST API endpoint rags using HTTP GET method
+    Then The status code of the response is 200
+     And the body of the response has the following structure
+    """
+    {
+      "rags": [
+        "pdf-field-notes"
+      ]
+    }
+    """
+
+  Scenario: Query retrieves content sourced from the PDF
+    When I use "query" to ask question with authorization header
+    """
+    {"query": "According to the field notes, what is the name of the mascot of Red Hat Lightspeed?", "system_prompt": "You are an assistant. Answer only from the provided context. Write only lowercase letters", "model": "{MODEL}", "provider": "{PROVIDER}"}
+    """
+    Then The status code of the response is 200
+     And The response contains following fragments
+         | Fragments in LLM response |
+         | zephyr                    |
+     And The response contains non-empty rag_chunks
+
+  Scenario: Streaming query retrieves content sourced from the PDF
+    When I use "streaming_query" to ask question with authorization header
+    """
+    {"query": "According to the field notes, what is the name of the mascot of Red Hat Lightspeed?", "system_prompt": "You are an assistant. Answer only from the provided context. Write only lowercase letters", "model": "{MODEL}", "provider": "{PROVIDER}"}
+    """
+    Then The status code of the response is 200
+     And I wait for the response to be completed
+     And The streamed response contains following fragments
+         | Fragments in LLM response |
+         | zephyr                    |
diff --git a/tests/e2e/features/environment.py b/tests/e2e/features/environment.py
@@ -196,10 +196,10 @@ def before_scenario(context: Context, scenario: Scenario) -> None:
     resetting per-scenario Lightspeed override tracking and skip-restart flags.
 
     Skips the scenario if it has the `skip` tag, if it has the `local` tag
-    while the test run is not in local mode, or if it has
-    `skip-in-library-mode` when running in library mode. Scenario-specific
-    Lightspeed YAML is applied in the feature files (``The service uses the
-    ... configuration`` steps).
+    while the test run is not in local mode, if it has `skip-in-library-mode`
+    when running in library mode, or if it has `skip-in-server-mode` when running
+    in server mode. Scenario-specific Lightspeed YAML is applied in the feature
+    files (``The service uses the ... configuration`` steps).
     """
     if "skip" in scenario.effective_tags:
         scenario.skip("Marked with @skip")
@@ -213,6 +213,17 @@ def before_scenario(context: Context, scenario: Scenario) -> None:
         scenario.skip("Skipped in library mode (no separate llama-stack container)")
         return
 
+    # Skip scenarios that rely on a non-default BYOK store. Only library mode
+    # re-enriches the (in-process) llama-stack with the active config's byok_rag
+    # on restart; in server mode the external llama-stack keeps its startup
+    # config, so a feature-specific store would not be loaded.
+    if not context.is_library_mode and "skip-in-server-mode" in scenario.effective_tags:
+        scenario.skip(
+            "Skipped in server mode (feature-specific BYOK store is not loaded "
+            "into the external llama-stack)"
+        )
+        return
+
     # Skip scenarios that depend on services not deployed in Prow/OpenShift
     # (e.g. mock-tls-inference, proxy sidecars only available in Docker Compose)
     if is_prow_environment() and "skip-in-prow" in scenario.effective_tags:
diff --git a/tests/e2e/rag/README.md b/tests/e2e/rag/README.md
@@ -1,2 +1,45 @@
-# List of source files stored in `tests/e2e/rag` directory
+# Vector stores in `tests/e2e/rag`
 
+This directory holds committed BYOK vector stores used by the e2e suite.
+
+## `kv_store.db`
+
+Faiss BYOK store used by `faiss.feature` and `inline_rag.feature` (the
+`e2e-test-docs` source). Consumed via the `FAISS_VECTOR_STORE_ID` /
+`KV_RAG_PATH` environment variables.
+
+## `pdf_kv_store.db` (+ `sources/lightspeed-field-notes.pdf`)
+
+Faiss BYOK store used by `byok_pdf.feature` (the `pdf-field-notes` source).
+It is **built from a PDF** by `rag-content`'s `pdf` module (LCORE-2091) to prove
+that a PDF-sourced vector store is consumed correctly by lightspeed-stack.
+
+- Source PDF: `sources/lightspeed-field-notes.pdf` — a tiny document containing a
+  single deliberately-fabricated fact (a "purple penguin named Zephyr"), so a
+  correct query answer can only come from the store, not the LLM's knowledge.
+- Embedding model: `sentence-transformers/all-mpnet-base-v2` (dimension 768).
+- Baked-in `vector_db_id`: `vs_4a27375c-b8da-4134-96fc-b8198d111015`
+  (hardcoded in `lightspeed-stack-byok-pdf.yaml`, so the feature is
+  self-contained and needs no externally-provisioned store id).
+
+### Reproduce
+
+From a `rag-content` checkout with PDF support (LCORE-2091):
+
+```bash
+python scripts/generate_embeddings.py \
+  -f <dir containing the PDF> \
+  -o <out dir> \
+  -i pdf-field-notes \
+  -m sentence-transformers/all-mpnet-base-v2 \
+  -d <embeddings model dir> \
+  -s llamastack-faiss \
+  -t pdf
+# -> <out dir>/faiss_store.db   (copied here as pdf_kv_store.db)
+```
+
+> Note: docling (used by the PDF reader) loads its own models from the Hugging
+> Face cache, but `DocumentProcessor` forces `HF_HOME` to the embeddings-model
+> dir. Until that is fixed (tracked separately), make docling's models reachable
+> by symlinking a populated `hub` into the embeddings-model dir
+> (`ln -s ~/.cache/huggingface/hub <embeddings model dir>/hub`).
diff --git a/tests/e2e/rag/pdf_kv_store.db b/tests/e2e/rag/pdf_kv_store.db
diff --git a/tests/e2e/rag/sources/lightspeed-field-notes.pdf b/tests/e2e/rag/sources/lightspeed-field-notes.pdf
diff --git a/tests/e2e/test_list.txt b/tests/e2e/test_list.txt
@@ -10,6 +10,7 @@ features/conversations.feature
 features/prompts.feature
 features/faiss.feature
 features/inline_rag.feature
+features/byok_pdf.feature
 features/feedback.feature
 features/query.feature
 features/responses.feature