Schema-first document intake service for turning PDF, DOCX, and image inputs into:
- validated typed records
- retrieval-ready chunks with provenance
- operator review tasks for ambiguous fields
- layout Markdown and embedding artifacts for downstream RAG pipelines
The repository ships with two execution modes:
deterministic: in-memory, test-friendly, no network callsazure: live OCR/layout + LLM normalization + chunk embeddings
[local file | URL]
|
v
[Azure Document Intelligence: prebuilt-layout]
|
v
[gpt-5-mini normalization by profile]
|
+--> [review queue for mid-confidence fields]
|
v
[chunk builder]
|
v
[text-embedding-3-small]
The current live configuration uses:
prebuilt-layoutfor OCR, table extraction, and layout Markdowngpt-5-minifor field normalizationgpt-5.4as the intended adjudication tier for harder follow-up taskstext-embedding-3-smallfor retrieval vectors
contract_v1counterparty_nameeffective_datetermination_notice_days
invoice_v1invoice_numbercurrencytotal_due_centsvendor_name
manual_v1document_titlerevisionrequires_review_cycle
Generated demo documents live under demo/input:
The image input is intentionally noisy enough to trigger a realistic review path:
The repository includes captured outputs from a live Azure-backed run in demo/output.
Summary:
[
{
"document_id": "contract-live-2026-04-15",
"profile": "contract_v1",
"status": "exported",
"trace": ["prebuilt-layout", "gpt-5-mini", "text-embedding-3-small"]
},
{
"document_id": "invoice-live-2026-04-15",
"profile": "invoice_v1",
"status": "exported",
"trace": ["prebuilt-layout", "gpt-5-mini", "text-embedding-3-small"]
},
{
"document_id": "manual-live-2026-04-15",
"profile": "manual_v1",
"status": "exported",
"trace": ["prebuilt-layout", "gpt-5-mini", "text-embedding-3-small"]
}
]Invoice extraction excerpt:
{
"invoice_number": "INV-2048-APR",
"currency": "USD",
"total_due_cents": 1284450,
"vendor_name": "Cascade Field Services"
}Manual scan review path:
{
"revision": {
"initial_value": "17",
"review_action": "correct",
"corrected_value": "r7"
}
}Contract chunk embedding artifact excerpt:
{
"chunk_id": "de55ec321afe71e63cc5e9a5a85ccef3e159b23c10293e1e999b39c1e8a252d6",
"dimensions": 1536,
"vector_preview": [-0.032795, 0.068723, 0.097257, 0.019036]
}POST /api/ingestGET /api/documents/{document_id}GET /api/artifacts/{document_id}POST /api/review/{document_id}GET /api/exports/{document_id}GET /healthz
Example ingest request:
curl -X POST http://127.0.0.1:8000/api/ingest \
-H "Content-Type: application/json" \
-d '{
"document_id": "invoice-live-2026-04-15",
"source": {
"uri": "/absolute/path/to/field-operations-invoice.pdf",
"mime_type": "application/pdf",
"checksum_sha256": "f349c0377448806316fed1124258ef77126a4f30506fa2f972fa1ba0644ee71e"
},
"profile": "invoice_v1",
"locale": "en-US",
"submitted_at": "2026-04-15T14:20:00Z",
"metadata": {
"tenant": "lab-demo"
}
}'Prerequisites:
- Python
3.9+ uv
Install:
uv sync --extra dev --extra demoDeterministic mode:
uv run uvicorn multimodal_doc_intake_kit.main:app --app-dir src --reloadLive Azure mode:
export DOC_INTAKE_PROVIDER=azure
export AZURE_DOCINTELLIGENCE_ENDPOINT="https://<resource>.cognitiveservices.azure.com/"
export AZURE_DOCINTELLIGENCE_API_KEY="<key>"
export AZURE_OPENAI_ENDPOINT="https://<resource>.openai.azure.com/"
export AZURE_OPENAI_API_KEY="<key>"
export AZURE_OPENAI_API_VERSION="2025-04-01-preview"
export AZURE_OPENAI_CHAT_DEPLOYMENT="gpt-5-mini"
export AZURE_OPENAI_REASONING_DEPLOYMENT="gpt-5.4"
export AZURE_OPENAI_EMBEDDING_DEPLOYMENT="text-embedding-3-small"
uv run uvicorn multimodal_doc_intake_kit.main:app --app-dir src --reloadGenerate the sample files:
uv run python scripts/generate_demo_inputs.pyRun the live end-to-end pipeline and write captured artifacts:
uv run python scripts/run_live_demo.pyArtifacts written:
demo/output/demo-summary.jsondemo/output/contract-live-2026-04-15.export.jsondemo/output/invoice-live-2026-04-15.export.jsondemo/output/manual-live-2026-04-15.review.jsondemo/output/manual-live-2026-04-15.export.json
Deployment and provider instructions live in docs/azure-foundry.md.
Short version:
- keep OCR/layout on Azure Document Intelligence
- keep schema normalization on a current chat/reasoning deployment (
gpt-5-minihere) - keep embeddings on
text-embedding-3-small - add a second reasoning tier only if you need explicit adjudication or post-review synthesis
The Azure implementation is the only provider wired in code today. Equivalent stacks for the same design:
- Google Cloud
- OCR/layout: Document AI layout parser or Enterprise Document OCR
- normalization:
gemini-2.5-pro - embeddings:
gemini-embedding-001
- OpenAI-compatible stack
- OCR/layout: external OCR service of choice
- normalization: current structured-output capable model
- embeddings: a small retrieval embedding model
.
├── demo/
│ ├── input/
│ └── output/
├── docs/
│ ├── architecture.md
│ ├── azure-foundry.md
│ ├── implementation-plan.md
│ └── pipeline-notes.md
├── schemas/
├── scripts/
│ ├── generate_demo_inputs.py
│ └── run_live_demo.py
├── src/multimodal_doc_intake_kit/
│ ├── azure_backend.py
│ ├── config.py
│ ├── main.py
│ ├── models.py
│ ├── pipeline.py
│ ├── service.py
│ └── store.py
└── tests/
uv run pytest -q