VisualDocQA Kit for local GPU document RAG.
Vision-first document RAG with ColQwen2, Qwen2.5-VL, Qdrant, and FastAPI for PDF, DOCX, and image question answering and field extraction.
VisoRAG is a notebook-originated visual document QA and extraction baseline. It renders documents as page images, embeds the pages with ColQwen2, retrieves relevant pages through an in-memory Qdrant MaxSim collection, and answers with local Qwen2.5-VL through a FastAPI API or CLI. It is meant for developers, students, and researchers who want a source-readable visual RAG reference rather than a hosted multi-user product.
| Area | Status | What it means |
|---|---|---|
| Dev verification | Ready | .[dev] runs import, API, notebook, lint, and mocked tests without GPU model downloads. |
| Real inference | Requires CUDA | .[gpu], model downloads, and a CUDA GPU are required for ColQwen2 and Qwen2.5-VL. |
| Production deployment | Not hardened | Use a gateway, TLS, restricted CORS, strong token, private logs, and GPU capacity planning first. |
No custom web frontend is included in this clean release. The visible product surface is the FastAPI API, CLI, notebook, docs, and tests.
This is a real screenshot from the local FastAPI /docs route. It shows the actual public API surface.
This is generated conceptual artwork for repository sharing. It is not a product screenshot and does not show a real query result.
Traditional document RAG often starts by extracting text, which can lose layout, tables, charts, figures, and scanned-page context. VisoRAG keeps the document as images first: pages are rendered visually, embedded as page images, retrieved by visual similarity, and passed to a local vision-language model for answering or field extraction.
Good fits:
- Exploring visual document RAG architecture.
- Building a local GPU baseline for PDF/image QA.
- Testing field extraction response contracts.
- Learning how to package a notebook workflow as a FastAPI/CLI project.
Not good fits yet:
- CPU-only inference.
- Hosted multi-user SaaS.
- Production document ingestion without additional security controls.
- Persistent vector search over a long-lived document corpus.
The authoritative implementation is VisoRAG_v8_Final (1).ipynb. This clean repository was built from that notebook only. Older sibling files from the original workspace are intentionally excluded because they were stale derived artifacts.
The Python package in src/visorag/ is the contributor-friendly extraction of the notebook behavior. It must preserve the notebook contracts unless a documented migration updates both the notebook and tests.
- Vision-first document QA without OCR-first text extraction.
- PDF, DOCX, PNG, JPG, and JPEG input support in the real GPU runtime.
- ColQwen2 page embeddings and Qdrant in-memory multivector MaxSim retrieval.
- Local Qwen2.5-VL answer generation with deterministic generation settings.
- Flat JSON responses for extraction mode.
- FastAPI app factory and CLI for local serving and testing.
- CI-friendly tests that mock heavy GPU/model work.
| Install | Use it for | Does not include |
|---|---|---|
python -m pip install -e ".[dev]" |
Static checks, notebook contract, API tests, mocked pipeline tests | ColQwen2/Qwen2.5-VL runtime validation |
python -m pip install -e ".[gpu]" |
Real visual retrieval and generation on CUDA | Production auth, hosting, monitoring |
| DOCX support | Requires LibreOffice soffice on PATH |
Not needed for PDF/image input |
Clone the repository and run the lightweight verification path:
git clone https://github.com/RossDmello2/visualdocqa-kit.git
cd visualdocqa-kit
python -m pip install -e ".[dev]"
python scripts/check_notebook_contract.py
python scripts/smoke_import.py
python -m pytestWhat this proves: the notebook still matches the expected contract, the package imports without loading GPU models, and API/pipeline behavior passes mocked tests. It does not prove real model inference.
For real GPU inference, install the GPU extra in a CUDA environment:
python -m pip install -e ".[gpu]"The notebook was designed around a CUDA GPU runtime such as Google Colab T4. The local Qwen2.5-VL generation path is not CPU-ready. No public sample document is bundled; use only synthetic or public documents when sharing screenshots or issues.
Start the local API:
export VISORAG_API_TOKEN="change-me"
python -m visorag serve --host 127.0.0.1 --port 8000PowerShell:
$env:VISORAG_API_TOKEN = "change-me"
python -m visorag serve --host 127.0.0.1 --port 8000change-me and the built-in test token fallback are local-development placeholders, not deployment values. Always set a strong VISORAG_API_TOKEN before serving real documents.
Health check:
python -m visorag health --url http://127.0.0.1:8000Submit a document:
python -m visorag query \
--url http://127.0.0.1:8000 \
--token "$VISORAG_API_TOKEN" \
--file /path/to/document.pdf \
--query "Extract total due and invoice number" \
--query-type extraction \
--top-k 5FastAPI routes:
GET /returns service metadata.GET /healthreturns model-load flags and device info.POST /queryaccepts multipartfileplusquery,query_type,document_type, andtop_k.
See docs/API_REFERENCE.md for status codes, request fields, and both API-layer and pipeline error shapes.
Extraction mode returns flat JSON:
{
"total_due": 6610.95,
"invoice_number": "BPXIN-00550"
}Factual and summary modes return:
{
"answer": "The total due is 6610.95."
}Pipeline errors include request_id; API validation/auth errors may only include error and detail because the pipeline has not started yet.
| Layer | Technology |
|---|---|
| Language | Python 3.10+ |
| API | FastAPI, Uvicorn |
| CLI | argparse, HTTPX |
| Visual retrieval | ColQwen2 via colpali-engine |
| Vector store | Qdrant in-memory multivector collection |
| Answer generation | Local Qwen/Qwen2.5-VL-3B-Instruct |
| Document rendering | PyMuPDF, Pillow, LibreOffice for DOCX conversion |
| Tests and checks | Pytest, Ruff, notebook JSON/AST contract checks |
flowchart LR
A["Upload PDF, DOCX, or image"] --> B["Validate request"]
B --> C["Render pages as images"]
C --> D["Embed pages with ColQwen2"]
D --> E["Index in per-request in-memory Qdrant"]
F["User query"] --> G["Embed query with ColQwen2"]
G --> H["Retrieve top visual pages"]
E --> H
H --> I["Generate answer with local Qwen2.5-VL"]
I --> J["Flat JSON extraction or answer response"]
Runtime constraints:
- Requests are serialized through a process-level query lock.
- Uploads default to 25 MB.
- Rendered documents default to 200 pages maximum.
- Qdrant state is per request and in memory.
- Uploaded files and rendered pages are temporary by default.
Longer architecture notes live in docs/ARCHITECTURE.md.
| Need | Start here |
|---|---|
| First setup | docs/GETTING_STARTED.md |
| API clients | docs/API_REFERENCE.md |
| Architecture | docs/ARCHITECTURE.md |
| Deployment | docs/DEPLOYMENT.md |
| Troubleshooting | docs/TROUBLESHOOTING.md |
| Contribution model | CONTRIBUTING.md and docs/feature-extension.md |
| Naming and SEO rationale | docs/NAMING_SEO_STRATEGY.md |
.
|-- VisoRAG_v8_Final (1).ipynb # authoritative notebook provenance
|-- src/visorag/ # import-safe package extraction
|-- tests/ # mocked unit/API/notebook contract tests
|-- scripts/ # static notebook and import smoke checks
|-- docs/ # architecture, API, deployment, and operations notes
|-- docs/assets/ # README visuals and captured API screenshots
`-- .github/ # CI, issue templates, PR template, Dependabot
python scripts/check_notebook_contract.py
python scripts/smoke_import.py
python -m ruff format --check .
python -m ruff check .
python -m pytestReal model inference is intentionally outside the default CI path because it requires CUDA, model downloads, and enough GPU memory. CI proves package/import/API contracts, not production GPU throughput.
VisoRAG is easiest to run as a long-running Python service on a GPU machine or notebook runtime. Do not expose the FastAPI server directly to the internet without a gateway, TLS, restricted CORS, a strong bearer token, request-log controls, and dependency/model-license review. See docs/DEPLOYMENT.md.
Do not upload private documents to public issues, logs, or CI artifacts. The local API uses bearer-token auth but enables permissive CORS for notebook compatibility. The runtime may log filenames and query snippets, so disable or restrict logs before processing sensitive documents. See SECURITY.md and docs/security-model.md.
Use SUPPORT.md for safe support requests and SECURITY.md for vulnerability reporting.
Read CONTRIBUTING.md and docs/feature-extension.md before opening a pull request. Behavior changes must preserve the notebook contract or document the migration.
Use CITATION.cff for software citation metadata. VisoRAG builds on ColQwen2, Qwen2.5-VL, Qdrant, PyMuPDF, FastAPI, and the Python scientific/ML ecosystem; retain upstream license notices when redistributing.
See docs/THIRD_PARTY_NOTICES.md for dependency notice guidance.
READY WITH GAPS: the source package, docs, CI checks, notebook contract, import smoke, mocked tests, local FastAPI health endpoint, API docs rendering, and approved GitHub rename to visualdocqa-kit have been verified. Real GPU inference with ColQwen2 and Qwen2.5-VL is documented but not verified in CI, and internet-facing deployment requires additional hardening.
MIT. See LICENSE.

