Welcome to PMOVES-DoX - a powerful document intelligence platform for extracting, analyzing, and structuring data from multiple document formats.
- What is PMOVES-DoX?
- Quick Start
- Core Concepts
- Features Overview
- Workflows
- Best Practices
- Troubleshooting
PMOVES-DoX is an enterprise-grade document processing platform that:
- Extracts structured data from PDFs, spreadsheets, logs, and APIs
- Analyzes documents using AI/ML models (NER, financial detection, metrics)
- Structures data with constellation harvest regularization (CHR)
- Visualizes insights with interactive dashboards
- Integrates with Microsoft Copilot via POML export
✨ Multi-Format Ingestion
- PDFs (with table extraction, OCR, VLM descriptions)
- CSV/XLSX spreadsheets
- XML logs with custom XPath mapping
- OpenAPI/Postman API specifications
- Web pages (with headless rendering)
- Audio/video transcription
- Image OCR
🔍 Advanced Analysis
- Named Entity Recognition (NER)
- Financial statement detection
- Business metric extraction
- Document structure analysis
- Tag extraction with LangExtract
🎯 Smart Search
- Vector-based semantic search (FAISS)
- Type filtering (PDF, API, LOG, TAG)
- Citation-based Q&A with source tracking
📊 Data Structuring
- Constellation Harvest Regularization (CHR)
- PCA visualization
- datavzrd dashboard generation
# Clone the repository
git clone https://github.com/POWERFULMOVES/PMOVES-DoX.git
cd PMOVES-DoX
# Start with Docker Compose
docker compose up --build
# Access the application
# Frontend: http://localhost:3737
# Backend API: http://localhost:8484Backend:
cd backend
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# Create .env file
cp .env.example .env
# Edit .env with your settings
# Run backend
python -m uvicorn app.main:app --host 0.0.0.0 --port 8484Frontend:
cd frontend
npm install
# Create .env.local
cp .env.local.example .env.local
# Run frontend
npm run dev-
Upload a Document
- Click "Choose Files" or drag & drop
- Supported: PDF, CSV, XLSX, XML, JSON
- Wait for processing to complete
-
Explore Facts
- View extracted facts in the Facts panel
- Click page numbers to see evidence
- Filter by artifact or search
-
Run Semantic Search
- Use the global search bar
- Try: "What is the total revenue?"
- Filter by type (PDF, API, LOG, TAG)
-
Ask Questions
- Navigate to Q&A tab
- Enter natural language questions
- Get answers with citations
What: An artifact is any ingested document or data source.
Types:
.pdf- PDF documents.csv,.xlsx- Spreadsheets.xml- Log files.json- OpenAPI/Postman specsurl- Web pages
Attributes:
id- Unique identifierfilename- Original filenamefilepath- Storage locationfiletype- File extensionreport_week- Optional temporal groupingstatus- Processing status
What: Atomic pieces of information extracted from artifacts.
Structure:
{
"id": "uuid",
"artifact_id": "uuid",
"page_number": 1,
"content": "Revenue: $1.2M",
"confidence": 0.95,
"report_week": "2024-W01"
}What: Detailed context supporting facts (tables, charts, formulas).
Types:
table- Tabular data with headerschart- Visual charts/figuresformula- Mathematical expressionsweb_page- Web contentaudio_transcript- Speech-to-textimage_ocr- OCR results
Structure:
{
"id": "uuid",
"artifact_id": "uuid",
"content_type": "table",
"locator": "Page 5, Table 2",
"preview": "Revenue breakdown...",
"full_data": { /* complete data */ }
}What: Semantic labels extracted using LangExtract/LLMs.
Use Cases:
- Learning Management System (LMS) tag extraction
- Application governance tagging
- Custom taxonomy mapping
PMOVES-DoX uses IBM Docling for advanced PDF processing:
Capabilities:
- Multi-page table detection and merging
- Chart/figure extraction with OCR
- Formula detection (inline + block equations)
- VLM captions for images (optional)
- Layout analysis with heading hierarchy
Configuration:
# Enable VLM picture descriptions
export DOCLING_VLM_REPO=ibm-granite/granite-docling-258M
# GPU acceleration
export DOCLING_DEVICE=cuda
# Financial statement detection
export PDF_FINANCIAL_ANALYSIS=trueExample:
# Upload a financial report
curl -X POST http://localhost:8484/upload \
-F "files=@financial_report_Q4.pdf" \
-F "report_week=2024-W52"Output:
- Extracted tables with merged headers
- Detected financial statements (income, balance sheet, cash flow)
- Chart images saved to
artifacts/charts/ - Formulas as evidence entries
Smart Processing:
- Automatic header detection
- Metric extraction (revenue, clicks, impressions)
- CTR calculation for ad performance
- Row-level fact creation
Example CSV:
date,revenue,clicks,impressions
2024-01-01,1500,245,12000
2024-01-02,1820,298,14500API Call:
curl -X POST http://localhost:8484/upload \
-F "files=@performance_data.csv" \
-F "report_week=2024-W01"Extracted Facts:
- Each row becomes a fact
- Metrics:
revenue=1500,clicks=245,ctr=2.04% - Searchable and queryable
Custom XPath Mapping:
Define how to parse XML logs with custom field mappings:
# Inline mapping
export XML_XPATH_MAP='{"entry":"//log","fields":{"ts":"./timestamp","level":"./severity","code":"./errorCode","component":"./service","message":"./msg"}}'
# Or file-based
export XML_XPATH_MAP_FILE=/app/config/xpath_map.yamlExample XML:
<logs>
<log>
<timestamp>2024-01-01T10:00:00Z</timestamp>
<severity>ERROR</severity>
<errorCode>E1001</errorCode>
<service>payment-service</service>
<msg>Payment gateway timeout</msg>
</log>
</logs>Upload:
curl -X POST http://localhost:8484/ingest/xml \
-H "Content-Type: application/json" \
-d '{"filepath":"logs/app.xml"}'Features:
- Time range filtering
- Severity level filtering
- Error code search
- CSV export
Ingest API Specifications:
# OpenAPI
curl -X POST http://localhost:8484/ingest/openapi \
-H "Content-Type: application/json" \
-d '{"filepath":"specs/petstore.json"}'
# Postman Collection
curl -X POST http://localhost:8484/ingest/postman \
-H "Content-Type: application/json" \
-d '{"collection_path":"collections/api-tests.json"}'Catalog View:
- Browse all API endpoints
- See request/response schemas
- Filter by method (GET, POST, PUT, DELETE)
- Export to documentation
Vector Search with FAISS:
# Search all content
curl -X POST http://localhost:8484/search \
-H "Content-Type: application/json" \
-d '{"q":"total revenue in Q4","k":10}'
# Filter by type
curl -X POST http://localhost:8484/search \
-H "Content-Type: application/json" \
-d '{"q":"error handling","types":["api"],"k":5}'Type Filters:
pdf- PDF documents onlyapi- API endpoints onlylog- Log entries onlytag- LMS tags only
Response:
{
"results": [
{
"content": "Q4 Revenue: $1.2M",
"score": 0.89,
"artifact_id": "uuid",
"page": 12,
"type": "pdf"
}
]
}Ask Natural Language Questions:
curl -X POST "http://localhost:8484/ask?question=What%20is%20the%20total%20revenue?"Features:
- Citation tracking (shows source page/table)
- Confidence scoring
- Multiple answer candidates
- Optional HRM refinement
UI Workflow:
- Navigate to Q&A tab
- Type question: "What was the highest expense category?"
- Get answer with citations
- Click citation to see source evidence
Extract Semantic Tags:
Providers:
- Google Gemini (default):
gemini-2.5-flash - Ollama (local):
ollama:gemma3
Configuration:
# Use Gemini
export LANGEXTRACT_API_KEY=your-gemini-key
export LANGEXTRACT_MODEL=gemini-2.5-flash
# Or use Ollama
export LANGEXTRACT_PROVIDER=ollama
export LANGEXTRACT_MODEL=ollama:gemma3
export OLLAMA_BASE_URL=http://ollama:11434API Call:
curl -X POST http://localhost:8484/extract/tags \
-H "Content-Type: application/json" \
-d '{
"document_id":"uuid",
"preset":"lms_comprehensive",
"dry_run":false,
"use_hrm":true
}'Presets:
lms_comprehensive- Full LMS taxonomylms_skills- Skills/competencies onlylms_governance- Compliance/governance tagscustom- Provide your own prompt
HRM (Hierarchical Refinement Module):
- Iterative tag deduplication
- Confidence-based filtering
- Early halting when stable
Constellation Harvest Regularization:
Transform unstructured text into structured clusters:
curl -X POST http://localhost:8484/structure/chr \
-H "Content-Type: application/json" \
-d '{
"artifact_id":"uuid",
"K":6,
"units_mode":"sentences",
"cluster_params":{"min_samples":2}
}'Parameters:
K- Number of clustersunits_mode-sentencesorparagraphscluster_params- HDBSCAN parameters
Output:
chr_clusters.csv- Clustered datachr_relation_strength.csv- Similarity matrixchr_pca.png- PCA visualizationdatavzrd.yaml- Dashboard config
View Results:
# Generate datavzrd dashboard
curl -X POST http://localhost:8484/viz/datavzrd \
-H "Content-Type: application/json" \
-d '{"csv_path":"artifacts/chr_clusters.csv"}'
# Start datavzrd (with tools profile)
docker compose --profile tools up datavzrd
# Access: http://localhost:5173Generate Multi-Document Summaries:
Styles:
bullet- Bullet point summaryexecutive- Executive summary paragraphaction_items- Action items list
Scopes:
workspace- All artifactsartifact- Specific artifact(s)
# Workspace summary
curl -X POST http://localhost:8484/summaries/generate \
-H "Content-Type: application/json" \
-d '{"style":"executive","scope":"workspace"}'
# Artifact-specific summary
curl -X POST http://localhost:8484/summaries/generate \
-H "Content-Type: application/json" \
-d '{
"style":"bullet",
"scope":"artifact",
"artifact_ids":["uuid1","uuid2"]
}'View History:
curl "http://localhost:8484/summaries?scope=workspace&style=executive"Generate POML for Copilot Studio:
curl -X POST http://localhost:8484/export/poml \
-H "Content-Type: application/json" \
-d '{
"document_id":"uuid",
"title":"Q4 Financial Analysis",
"variant":"catalog"
}'Variants:
generic- General knowledge basetroubleshoot- Troubleshooting guidecatalog- API/service catalog
Output:
<?xml version="1.0" encoding="UTF-8"?>
<promptOML xmlns="http://schemas.microsoft.com/copilot/promptOML/1.0">
<title>Q4 Financial Analysis</title>
<content>
<section name="Overview">
<text>Financial data from Q4 2024...</text>
</section>
<section name="APIs">
<text>GET /api/revenue - Returns revenue data</text>
</section>
</content>
</promptOML>Goal: Extract and analyze financial statements from PDF reports.
Steps:
-
Upload PDF:
curl -X POST http://localhost:8484/upload \ -F "files=@annual_report_2024.pdf" \ -F "report_week=2024-W52"
-
Check Financial Detection:
curl http://localhost:8484/analysis/financials
Response:
{ "statements": [ { "type": "income_statement", "page": 15, "confidence": 0.92, "table_id": "uuid" } ] } -
View Extracted Tables:
curl http://localhost:8484/artifacts/{artifact_id} -
Ask Questions:
curl -X POST "http://localhost:8484/ask?question=What%20was%20net%20income%20in%202024?" -
Export to POML:
curl -X POST http://localhost:8484/export/poml \ -d '{"document_id":"uuid","variant":"catalog"}'
Goal: Analyze application logs for errors and patterns.
Steps:
-
Configure XPath Mapping:
export XML_XPATH_MAP='{"entry":"//log","fields":{"ts":"./time","level":"./level","code":"./code","message":"./msg"}}'
-
Ingest Logs:
curl -X POST http://localhost:8484/ingest/xml \ -d '{"filepath":"logs/app-2024-01.xml"}' -
Filter Errors:
curl "http://localhost:8484/logs?level=ERROR&from=2024-01-01T00:00:00Z&to=2024-01-31T23:59:59Z" -
Export CSV:
curl "http://localhost:8484/logs/export?level=ERROR" > errors.csv
-
Generate Dashboard:
curl -X POST http://localhost:8484/viz/datavzrd/logs \ -d '{"level":"ERROR","from":"2024-01-01T00:00:00Z"}'
Goal: Create searchable API catalog from OpenAPI specs.
Steps:
-
Ingest OpenAPI Spec:
curl -X POST http://localhost:8484/ingest/openapi \ -d '{"filepath":"specs/petstore-v3.json"}' -
Browse Endpoints:
curl http://localhost:8484/apis
-
Search Specific Operation:
curl -X POST http://localhost:8484/search \ -d '{"q":"create user","types":["api"]}' -
View Endpoint Details:
curl http://localhost:8484/apis/{api_id} -
Export Documentation:
curl -X POST http://localhost:8484/export/poml \ -d '{"document_id":"uuid","variant":"catalog","title":"API Reference"}'
Goal: Organize unstructured documents into thematic clusters.
Steps:
-
Upload Multiple PDFs:
for file in docs/*.pdf; do curl -X POST http://localhost:8484/upload -F "files=@$file" done
-
Rebuild Search Index:
curl -X POST http://localhost:8484/search/rebuild
-
Run CHR:
curl -X POST http://localhost:8484/structure/chr \ -d '{"artifact_id":"uuid","K":8,"units_mode":"paragraphs"}' -
Visualize Clusters:
- Open
artifacts/chr_pca.png - Review
artifacts/chr_clusters.csv
- Open
-
Generate Dashboard:
docker compose --profile tools up datavzrd # Visit http://localhost:5173
-
Use GPU Acceleration:
export DOCLING_DEVICE=cuda export SEARCH_DEVICE=cuda
-
Async PDF Processing:
curl -X POST "http://localhost:8484/upload?async_pdf=true" -F "files=@large.pdf"
-
Batch Uploads:
curl -X POST http://localhost:8484/upload \ -F "files=@doc1.pdf" \ -F "files=@doc2.pdf" \ -F "files=@doc3.pdf"
-
Use Report Weeks:
# Group by time period curl -X POST http://localhost:8484/upload \ -F "files=@q1_report.pdf" \ -F "report_week=2024-W13"
-
Tag Early:
- Extract tags immediately after upload
- Use dry-run mode to preview
- Save custom prompts for reuse
-
Regular Index Rebuilds:
# After bulk uploads curl -X POST http://localhost:8484/search/rebuild
-
File Size Limits:
- Default: 100MB per file
- Adjust via
MAX_FILE_SIZEin code
-
SSRF Protection:
- Web ingestion blocks private IPs
- Only http/https/data URLs allowed
-
API Authentication:
- Currently no auth (local use)
- Add reverse proxy for production (nginx, Caddy)
-
Check Health:
curl http://localhost:8484/health
-
View Tasks:
curl http://localhost:8484/tasks
-
Monitor Logs:
docker compose logs -f backend
1. "Port already in use"
# Check what's using the port
lsof -i :8484
# Change port
export PORT=85852. "CUDA out of memory"
# Use CPU instead
export DOCLING_DEVICE=cpu
export SEARCH_DEVICE=cpu
# Or reduce batch size
export DOCLING_NUM_THREADS=23. "Module not found: docling"
# Reinstall dependencies
pip install -r backend/requirements.txt --force-reinstall4. "VLM model not found"
# Download model
export HUGGINGFACE_HUB_TOKEN=your-token
export DOCLING_VLM_REPO=ibm-granite/granite-docling-258M
# Or disable VLM
unset DOCLING_VLM_REPO5. "Search returns no results"
# Rebuild index
curl -X POST http://localhost:8484/search/rebuild
# Check if artifacts exist
curl http://localhost:8484/artifacts6. "LangExtract timeout"
# Increase timeout or switch to local Ollama
export LANGEXTRACT_PROVIDER=ollama
export LANGEXTRACT_MODEL=ollama:gemma3
export OLLAMA_BASE_URL=http://localhost:11434# Enable verbose logging
export LOG_LEVEL=DEBUG
# Run backend with reload
uvicorn app.main:app --reload --host 0.0.0.0 --port 8484# Clear all data
curl -X DELETE http://localhost:8484/reset
# Or manually
rm backend/database.db
rm -rf backend/uploads/* backend/artifacts/*- 📖 Check out COOKBOOKS.md for detailed tutorials
- 🎨 See DEMOS.md for interactive examples
- 🔧 Read API_REFERENCE.md for complete API docs
- 🏗️ Review ARCHITECTURE.md for system design
Questions or Issues?
- GitHub Issues: https://github.com/POWERFULMOVES/PMOVES-DoX/issues
- Documentation: https://github.com/POWERFULMOVES/PMOVES-DoX/tree/main/docs