Skip to content

Commit e4eaec7

Browse files
committed
feat(docs): add multi-document collection discovery guide
1 parent 0a67db1 commit e4eaec7

1 file changed

Lines changed: 148 additions & 8 deletions

File tree

docs/discovery.md

Lines changed: 148 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ https://github.com/user-attachments/assets/9c3923fb-f4ff-43cd-a563-44c7c6132921
2828
- [Discovery Without Ground Truth](#discovery-without-ground-truth)
2929
- [Discovery With Ground Truth](#discovery-with-ground-truth)
3030
- [Multi-Section Package Discovery](#multi-section-package-discovery)
31+
- [Multi-Document Collection Discovery](#multi-document-collection-discovery)
3132
- [Choosing the Right Method](#choosing-the-right-method)
3233
- [Configuration](#configuration)
3334
- [Model Configuration](#model-configuration)
@@ -84,6 +85,7 @@ This analysis produces structured configuration templates that can be used to co
8485
- **⚡ Real-Time Processing**: Provides immediate feedback through the web interface
8586
- **📊 PDF Page Thumbnails**: Visual page preview with color-coded range highlighting in the browser
8687
- **🔗 BDA Blueprint Automation**: Automatic BDA blueprint creation when running in BDA mode
88+
- **📦 Multi-Document Collection Discovery**: Discover document classes from a collection of documents using embedding-based clustering (local or S3)
8789

8890
### Use Cases
8991

@@ -378,16 +380,154 @@ When a `page_range` is specified for a PDF, the system uses `pypdfium2` to extra
378380
- Each page range job is independent and can run in parallel
379381
- The original document is never modified
380382

383+
### Multi-Document Collection Discovery
384+
385+
While the other discovery methods analyze a single document at a time, **Multi-Document Collection Discovery** discovers document classes from a *collection* of documents using embedding-based clustering. Given a folder of mixed documents (e.g., invoices, W-2s, bank statements), it automatically groups similar documents together and generates a JSON Schema and classification for each group.
386+
387+
> **Requires extra dependencies:** `pip install "idp_common[multi_document_discovery]"` or `make setup` from the project root. This installs scikit-learn, scipy, numpy, strands-agents, and pypdfium2.
388+
389+
> **Minimum 2 documents per class:** Clusters with fewer than 2 documents are filtered as noise. Ensure you provide at least 2 documents for each expected document type.
390+
391+
#### How It Works
392+
393+
```mermaid
394+
graph LR
395+
A[Document Collection] --> B[Embed]
396+
B --> C[Cluster]
397+
C --> D[Analyze]
398+
D --> E[Reflect]
399+
E --> F[Discovered Classes + Schemas]
400+
```
401+
402+
1. **Embed** — Each document's first page is rendered to an image and embedded as a vector using Amazon Bedrock (`us.cohere.embed-v4:0` by default)
403+
2. **Cluster** — Embeddings are clustered using KMeans with automatic K selection via silhouette analysis (scikit-learn). Clusters with fewer than `min_cluster_size` (default: 2) documents are filtered as noise.
404+
3. **Analyze** — For each cluster, a Strands agent with Claude (`us.anthropic.claude-sonnet-4-6`) examines sample document images and generates a classification name + JSON Schema definition
405+
4. **Reflect** — The agent produces a Markdown reflection report reviewing all discovered classes, their relationships, and potential overlaps
406+
407+
#### Supported File Types
408+
409+
`.pdf`, `.png`, `.jpg`, `.jpeg`, `.tiff`, `.tif`, `.webp`
410+
411+
#### Two Execution Modes
412+
413+
| Mode | Documents Source | Use Case | Entry Point |
414+
|------|-----------------|----------|-------------|
415+
| **Local** | Local filesystem | CLI/SDK development, no AWS infra needed | `run_local_pipeline()` |
416+
| **S3** | Amazon S3 bucket | Lambda/Step Functions, production workloads | `run_full_pipeline()` |
417+
418+
Both modes produce identical `MultiDocDiscoveryResult` output and support the same pipeline steps.
419+
420+
#### Usage — IDP CLI
421+
422+
The simplest way to run multi-document discovery:
423+
424+
```bash
425+
# Discover classes from a directory of documents
426+
idp-cli discover-multidoc --dir /path/to/documents/
427+
428+
# With explicit files
429+
idp-cli discover-multidoc -d invoice1.pdf -d invoice2.pdf -d w2_form.pdf -d w2_form2.pdf
430+
431+
# Save results to a configuration version
432+
idp-cli discover-multidoc --dir /path/to/documents/ --save-to-config --config-version v1
433+
```
434+
435+
See [IDP CLI Reference — `discover-multidoc`](idp-cli.md) for all options.
436+
437+
#### Usage — IDP SDK
438+
439+
```python
440+
from idp_sdk import IDPClient
441+
442+
client = IDPClient()
443+
result = client.discovery.run_multi_doc(
444+
document_dir="/path/to/documents/",
445+
progress_callback=lambda step, data: print(f"{step}: {data}"),
446+
)
447+
448+
print(f"Status: {result.status}")
449+
print(f"Found {result.total_clusters} clusters from {result.total_documents} documents")
450+
451+
for cls in result.discovered_classes:
452+
print(f" {cls.classification} — {cls.document_count} docs")
453+
if cls.json_schema:
454+
print(f" Fields: {list(cls.json_schema.get('properties', {}).keys())}")
455+
456+
print(result.reflection_report)
457+
```
458+
459+
See [IDP SDK Reference — `discovery.run_multi_doc()`](idp-sdk.md) for all parameters.
460+
461+
#### Usage — idp_common Directly
462+
463+
```python
464+
from idp_common.discovery.multi_document_discovery import MultiDocumentDiscovery
465+
466+
discovery = MultiDocumentDiscovery(
467+
region="us-east-1",
468+
config={
469+
"embedding_model_id": "us.cohere.embed-v4:0",
470+
"analysis_model_id": "us.anthropic.claude-sonnet-4-6",
471+
"min_cluster_size": 2,
472+
},
473+
)
474+
475+
# Local pipeline
476+
result = discovery.run_local_pipeline(
477+
document_dir="/path/to/documents/",
478+
config_version="v1", # Optional: save to DynamoDB config
479+
)
480+
481+
# Or S3 pipeline
482+
result = discovery.run_full_pipeline(
483+
bucket="my-bucket",
484+
prefix="documents/batch-001/",
485+
)
486+
```
487+
488+
See [idp_common API Reference — MultiDocumentDiscovery](idpcommon-api-reference.md#multidocumentdiscovery--multi-document-collection-discovery) for full method-level documentation.
489+
490+
#### Output
491+
492+
The pipeline returns a `MultiDocDiscoveryResult` containing:
493+
494+
| Field | Description |
495+
|-------|-------------|
496+
| `discovered_classes` | List of discovered classes, each with `classification`, `json_schema`, `document_count`, `sample_doc_ids` |
497+
| `reflection_report` | Markdown report analyzing all discovered classes |
498+
| `total_documents` | Total documents processed |
499+
| `num_clusters` | Number of clusters found |
500+
| `num_failed_embeddings` | Documents that failed embedding |
501+
| `num_successful_schemas` / `num_failed_schemas` | Schema generation success/failure counts |
502+
503+
#### Web UI
504+
505+
The Web UI includes a **Multi-Doc Discovery** tab in the Discovery panel. This tab allows you to:
506+
- Select a directory of documents or upload multiple files
507+
- Monitor pipeline progress with step-by-step status updates
508+
- View discovered classes and their generated schemas
509+
510+
> **Note:** The Web UI multi-doc discovery feature requires the IDP stack to be deployed with the multi-document discovery nested stack enabled.
511+
512+
#### Best For
513+
514+
- **Bulk onboarding**: You have a folder of hundreds of mixed documents and want to automatically discover all document types
515+
- **No prior knowledge**: You don't know what types of documents are in the collection
516+
- **Classification + Schema in one step**: Generates both the document class names and extraction schemas simultaneously
517+
- **Local development**: Run from your workstation with just Bedrock model access — no AWS deployment needed
518+
381519
### Choosing the Right Method
382520

383-
| Factor | Without Ground Truth | With Ground Truth |
384-
|--------|---------------------|-------------------|
385-
| **Use Case** | New document exploration | Configuration optimization |
386-
| **Accuracy** | Good for structure discovery | Higher accuracy for known patterns |
387-
| **Speed** | Fast, single-pass analysis | Optimized based on reference data |
388-
| **Consistency** | May vary between runs | Consistent with reference patterns |
389-
| **Setup Effort** | Minimal - just upload document | Requires ground truth preparation |
390-
| **Best For** | Unknown document types | Improving existing workflows |
521+
| Factor | Without Ground Truth | With Ground Truth | Multi-Section Package | Multi-Document Collection |
522+
|--------|---------------------|-------------------|-----------------------|--------------------------|
523+
| **Input** | Single document | Single document + ground truth | Single multi-page PDF | Collection of documents |
524+
| **Use Case** | New document exploration | Configuration optimization | Mixed document packages | Bulk class discovery |
525+
| **Accuracy** | Good for structure discovery | Higher for known patterns | Good per-section | Good for clustering |
526+
| **Speed** | Fast, single-pass | Optimized with reference | Parallel per range | Minutes for 100+ docs |
527+
| **Setup Effort** | Minimal | Requires ground truth | Define page ranges | Just point at a folder |
528+
| **Output** | 1 class + schema | 1 class + schema | N classes + schemas | N classes + schemas |
529+
| **Best For** | Unknown document types | Improving existing workflows | Known multi-doc packages | Unknown mixed collections |
530+
| **Extra Deps** | None | None | None | `multi_document_discovery` pip extra |
391531

392532
## Configuration
393533

0 commit comments

Comments
 (0)