LCORE-2093: e2e test for PDF-built BYOK vector store + BYOK guide docs (LCORE-2094)#1871
Conversation
rag-content now ingests PDF and HTML directly via docling (LCORE-2091), so the guide should no longer tell users to pre-convert PDFs to Markdown. - Knowledge Sources: move PDF and HTML to "Directly supported" and note they are converted automatically by rag-content (docling); keep AsciiDoc under "Requires conversion". Link the rag-content README's supported-formats section. - Step 1: drop the docling-as-pre-conversion step; document that .md, .txt, .pdf, and .html ingest directly with the matching document type, and add the scanned/image-only PDF caveat (OCR disabled; indexes empty with a warning). - Important Notes: strengthen the embedding-model requirement — the model and dimension used to build a store must match the one configured for querying; a mismatch silently returns no or irrelevant results rather than erroring.
…speed-stack
Verify end to end that a vector store built from a PDF by rag-content's pdf
module (LCORE-2091) is consumed correctly by lightspeed-stack: the BYOK source
registers and a query retrieves content that exists only in the source PDF.
Uses a committed fixture store so the test is self-contained and does not depend
on a cross-repo rag-content build or an externally-provisioned vector-store id.
- tests/e2e/features/byok_pdf.feature: mirrors inline_rag.feature; asserts the
pdf-field-notes source is registered and that query / streaming_query retrieve
the deliberately-fabricated "zephyr" fact (with non-empty rag_chunks), which
can only come from the store, not the LLM's own knowledge.
- tests/e2e/configuration/{library,server}-mode/lightspeed-stack-byok-pdf.yaml:
BYOK config with the vector_db_id and db_path hardcoded to the committed
fixture (no FAISS_VECTOR_STORE_ID CI variable required).
- tests/e2e/rag/pdf_kv_store.db: faiss BYOK store built from a PDF with the
all-mpnet-base-v2 embedding model (dimension 768).
- tests/e2e/rag/sources/lightspeed-field-notes.pdf: committed source PDF.
- tests/e2e/rag/README.md: document the stores and how to reproduce the PDF one.
- tests/e2e/test_list.txt: register features/byok_pdf.feature.
|
Warning Review limit reached
More reviews will be available in 27 minutes and 10 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (6)
✨ Finishing Touches🧪 Generate unit tests (beta)
✨ Simplify code
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
CI surfaced that the query scenarios failed: the LLM hallucinated a mascot name
instead of returning the fact from the PDF, i.e. retrieval returned nothing.
Two causes:
- The committed fixture store is mounted into the e2e container at
~/.llama/storage/rag (per docker-compose / docker-compose-library), so the
byok_rag db_path must point there. It pointed at the repo-relative path
tests/e2e/rag/pdf_kv_store.db, which does not resolve at the container's
working directory, so the BYOK store loaded empty.
- The feature ran in server mode too, where the external llama-stack keeps its
startup config and is not re-enriched per feature; a feature-specific
(non-default) BYOK store is therefore not loadable there.
Fixes:
- Point db_path at ${env.PDF_KV_RAG_PATH:=~/.llama/storage/rag/pdf_kv_store.db}.
- Add a symmetric @skip-in-server-mode tag (handled in before_scenario, mirroring
@skip-in-library-mode) and apply it to byok_pdf.feature; remove the unusable
server-mode config.
Description
Add an end-to-end test proving a PDF-built vector store is consumed correctly by
lightspeed-stack, and update the BYOK guide now that PDF (and HTML) are directly
supported by rag-content. Part of the BYOK PDF epic (LCORE-2090): LCORE-2093
(e2e) and LCORE-2094 (docs). Depends on rag-content PDF support (LCORE-2091).
LCORE-2093 (e2e):
tests/e2e/features/byok_pdf.featuremirrorsinline_rag.feature: it assertsthe BYOK source registers and that a query (and streaming query) retrieves a
fact that exists only in the source PDF, so a correct answer can only come from
the store and not the LLM's own knowledge.
tests/e2e/rag/pdf_kv_store.db), built from a PDF byrag-content's
pdfmodule, with thevector_db_id/db_pathhardcoded inlightspeed-stack-byok-pdf.yaml(library + server mode). This keeps the testself-contained — no cross-repo build and no externally-provisioned store id.
tests/e2e/rag/sources/and the build isdocumented in
tests/e2e/rag/README.md.LCORE-2094 (docs),
docs/byok_guide.md:removed; scanned/image-only PDF caveat added; rag-content README linked.
build a store must match the one configured for querying, or retrieval silently
returns nothing.
Type of change
Tools used to create PR
Related Tickets & Documents
Checklist before requesting a review
Testing
How to test:
Run the new e2e feature (library mode) with the full stack, e.g.:
or target just the feature:
Expected: the
pdf-field-notesBYOK source is registered, and a query forthe mascot of "Red Hat Lightspeed" returns the fabricated "zephyr" fact with
non-empty
rag_chunks.Results from this change:
behave --dry-runon the feature resolves every step to a definition (noundefined steps); both config YAMLs validate.
lightspeed-stack-byok-pdf.yamlpointed at the committed PDF-built store. Thequery "what is the mascot of Red Hat Lightspeed and on what day does it answer
questions?" returned: "The official mascot of Red Hat Lightspeed is a purple
penguin named Zephyr. Zephyr only answers questions on Tuesdays.", with
rag_chunksattributed tolightspeed-field-notes.pdf— content that existsonly in the PDF. This confirms the PDF-built store is retrieved end to end.
the pre-merge e2e job.
Note: no Python source changed in this PR (e2e feature/config/fixtures + docs),
so the Python linters are not applicable.