Skip to content

RossDmello2/visualdocqa-kit

VisoRAG

VisualDocQA Kit for local GPU document RAG.

Vision-first document RAG with ColQwen2, Qwen2.5-VL, Qdrant, and FastAPI for PDF, DOCX, and image question answering and field extraction.

CI License: MIT

VisoRAG is a notebook-originated visual document QA and extraction baseline. It renders documents as page images, embeds the pages with ColQwen2, retrieves relevant pages through an in-memory Qdrant MaxSim collection, and answers with local Qwen2.5-VL through a FastAPI API or CLI. It is meant for developers, students, and researchers who want a source-readable visual RAG reference rather than a hosted multi-user product.

Read This First

Area Status What it means
Dev verification Ready .[dev] runs import, API, notebook, lint, and mocked tests without GPU model downloads.
Real inference Requires CUDA .[gpu], model downloads, and a CUDA GPU are required for ColQwen2 and Qwen2.5-VL.
Production deployment Not hardened Use a gateway, TLS, restricted CORS, strong token, private logs, and GPU capacity planning first.

No custom web frontend is included in this clean release. The visible product surface is the FastAPI API, CLI, notebook, docs, and tests.

Preview

Real FastAPI Swagger documentation for VisoRAG

This is a real screenshot from the local FastAPI /docs route. It shows the actual public API surface.

Conceptual VisoRAG social preview showing document pages flowing through retrieval into a JSON answer

This is generated conceptual artwork for repository sharing. It is not a product screenshot and does not show a real query result.

What It Solves

Traditional document RAG often starts by extracting text, which can lose layout, tables, charts, figures, and scanned-page context. VisoRAG keeps the document as images first: pages are rendered visually, embedded as page images, retrieved by visual similarity, and passed to a local vision-language model for answering or field extraction.

Good fits:

  • Exploring visual document RAG architecture.
  • Building a local GPU baseline for PDF/image QA.
  • Testing field extraction response contracts.
  • Learning how to package a notebook workflow as a FastAPI/CLI project.

Not good fits yet:

  • CPU-only inference.
  • Hosted multi-user SaaS.
  • Production document ingestion without additional security controls.
  • Persistent vector search over a long-lived document corpus.

Source of Truth

The authoritative implementation is VisoRAG_v8_Final (1).ipynb. This clean repository was built from that notebook only. Older sibling files from the original workspace are intentionally excluded because they were stale derived artifacts.

The Python package in src/visorag/ is the contributor-friendly extraction of the notebook behavior. It must preserve the notebook contracts unless a documented migration updates both the notebook and tests.

Features

  • Vision-first document QA without OCR-first text extraction.
  • PDF, DOCX, PNG, JPG, and JPEG input support in the real GPU runtime.
  • ColQwen2 page embeddings and Qdrant in-memory multivector MaxSim retrieval.
  • Local Qwen2.5-VL answer generation with deterministic generation settings.
  • Flat JSON responses for extraction mode.
  • FastAPI app factory and CLI for local serving and testing.
  • CI-friendly tests that mock heavy GPU/model work.

Install Modes

Install Use it for Does not include
python -m pip install -e ".[dev]" Static checks, notebook contract, API tests, mocked pipeline tests ColQwen2/Qwen2.5-VL runtime validation
python -m pip install -e ".[gpu]" Real visual retrieval and generation on CUDA Production auth, hosting, monitoring
DOCX support Requires LibreOffice soffice on PATH Not needed for PDF/image input

Quickstart

Clone the repository and run the lightweight verification path:

git clone https://github.com/RossDmello2/visualdocqa-kit.git
cd visualdocqa-kit
python -m pip install -e ".[dev]"
python scripts/check_notebook_contract.py
python scripts/smoke_import.py
python -m pytest

What this proves: the notebook still matches the expected contract, the package imports without loading GPU models, and API/pipeline behavior passes mocked tests. It does not prove real model inference.

For real GPU inference, install the GPU extra in a CUDA environment:

python -m pip install -e ".[gpu]"

The notebook was designed around a CUDA GPU runtime such as Google Colab T4. The local Qwen2.5-VL generation path is not CPU-ready. No public sample document is bundled; use only synthetic or public documents when sharing screenshots or issues.

API Usage

Start the local API:

export VISORAG_API_TOKEN="change-me"
python -m visorag serve --host 127.0.0.1 --port 8000

PowerShell:

$env:VISORAG_API_TOKEN = "change-me"
python -m visorag serve --host 127.0.0.1 --port 8000

change-me and the built-in test token fallback are local-development placeholders, not deployment values. Always set a strong VISORAG_API_TOKEN before serving real documents.

Health check:

python -m visorag health --url http://127.0.0.1:8000

Submit a document:

python -m visorag query \
  --url http://127.0.0.1:8000 \
  --token "$VISORAG_API_TOKEN" \
  --file /path/to/document.pdf \
  --query "Extract total due and invoice number" \
  --query-type extraction \
  --top-k 5

FastAPI routes:

  • GET / returns service metadata.
  • GET /health returns model-load flags and device info.
  • POST /query accepts multipart file plus query, query_type, document_type, and top_k.

See docs/API_REFERENCE.md for status codes, request fields, and both API-layer and pipeline error shapes.

Response Contract

Extraction mode returns flat JSON:

{
  "total_due": 6610.95,
  "invoice_number": "BPXIN-00550"
}

Factual and summary modes return:

{
  "answer": "The total due is 6610.95."
}

Pipeline errors include request_id; API validation/auth errors may only include error and detail because the pipeline has not started yet.

Tech Stack

Layer Technology
Language Python 3.10+
API FastAPI, Uvicorn
CLI argparse, HTTPX
Visual retrieval ColQwen2 via colpali-engine
Vector store Qdrant in-memory multivector collection
Answer generation Local Qwen/Qwen2.5-VL-3B-Instruct
Document rendering PyMuPDF, Pillow, LibreOffice for DOCX conversion
Tests and checks Pytest, Ruff, notebook JSON/AST contract checks

Architecture

flowchart LR
    A["Upload PDF, DOCX, or image"] --> B["Validate request"]
    B --> C["Render pages as images"]
    C --> D["Embed pages with ColQwen2"]
    D --> E["Index in per-request in-memory Qdrant"]
    F["User query"] --> G["Embed query with ColQwen2"]
    G --> H["Retrieve top visual pages"]
    E --> H
    H --> I["Generate answer with local Qwen2.5-VL"]
    I --> J["Flat JSON extraction or answer response"]
Loading

Runtime constraints:

  • Requests are serialized through a process-level query lock.
  • Uploads default to 25 MB.
  • Rendered documents default to 200 pages maximum.
  • Qdrant state is per request and in memory.
  • Uploaded files and rendered pages are temporary by default.

Longer architecture notes live in docs/ARCHITECTURE.md.

Documentation Map

Need Start here
First setup docs/GETTING_STARTED.md
API clients docs/API_REFERENCE.md
Architecture docs/ARCHITECTURE.md
Deployment docs/DEPLOYMENT.md
Troubleshooting docs/TROUBLESHOOTING.md
Contribution model CONTRIBUTING.md and docs/feature-extension.md
Naming and SEO rationale docs/NAMING_SEO_STRATEGY.md

Project Structure

.
|-- VisoRAG_v8_Final (1).ipynb   # authoritative notebook provenance
|-- src/visorag/                 # import-safe package extraction
|-- tests/                       # mocked unit/API/notebook contract tests
|-- scripts/                     # static notebook and import smoke checks
|-- docs/                        # architecture, API, deployment, and operations notes
|-- docs/assets/                 # README visuals and captured API screenshots
`-- .github/                     # CI, issue templates, PR template, Dependabot

Testing

python scripts/check_notebook_contract.py
python scripts/smoke_import.py
python -m ruff format --check .
python -m ruff check .
python -m pytest

Real model inference is intentionally outside the default CI path because it requires CUDA, model downloads, and enough GPU memory. CI proves package/import/API contracts, not production GPU throughput.

Deployment

VisoRAG is easiest to run as a long-running Python service on a GPU machine or notebook runtime. Do not expose the FastAPI server directly to the internet without a gateway, TLS, restricted CORS, a strong bearer token, request-log controls, and dependency/model-license review. See docs/DEPLOYMENT.md.

Security

Do not upload private documents to public issues, logs, or CI artifacts. The local API uses bearer-token auth but enables permissive CORS for notebook compatibility. The runtime may log filenames and query snippets, so disable or restrict logs before processing sensitive documents. See SECURITY.md and docs/security-model.md.

Support

Use SUPPORT.md for safe support requests and SECURITY.md for vulnerability reporting.

Contributing

Read CONTRIBUTING.md and docs/feature-extension.md before opening a pull request. Behavior changes must preserve the notebook contract or document the migration.

Citation

Use CITATION.cff for software citation metadata. VisoRAG builds on ColQwen2, Qwen2.5-VL, Qdrant, PyMuPDF, FastAPI, and the Python scientific/ML ecosystem; retain upstream license notices when redistributing.

See docs/THIRD_PARTY_NOTICES.md for dependency notice guidance.

Current Status

READY WITH GAPS: the source package, docs, CI checks, notebook contract, import smoke, mocked tests, local FastAPI health endpoint, API docs rendering, and approved GitHub rename to visualdocqa-kit have been verified. Real GPU inference with ColQwen2 and Qwen2.5-VL is documented but not verified in CI, and internet-facing deployment requires additional hardening.

License

MIT. See LICENSE.