You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
-[Choosing the Right Method](#choosing-the-right-method)
32
33
-[Configuration](#configuration)
33
34
-[Model Configuration](#model-configuration)
@@ -84,6 +85,7 @@ This analysis produces structured configuration templates that can be used to co
84
85
-**⚡ Real-Time Processing**: Provides immediate feedback through the web interface
85
86
-**📊 PDF Page Thumbnails**: Visual page preview with color-coded range highlighting in the browser
86
87
-**🔗 BDA Blueprint Automation**: Automatic BDA blueprint creation when running in BDA mode
88
+
-**📦 Multi-Document Collection Discovery**: Discover document classes from a collection of documents using embedding-based clustering (local or S3)
87
89
88
90
### Use Cases
89
91
@@ -378,16 +380,154 @@ When a `page_range` is specified for a PDF, the system uses `pypdfium2` to extra
378
380
- Each page range job is independent and can run in parallel
379
381
- The original document is never modified
380
382
383
+
### Multi-Document Collection Discovery
384
+
385
+
While the other discovery methods analyze a single document at a time, **Multi-Document Collection Discovery** discovers document classes from a *collection* of documents using embedding-based clustering. Given a folder of mixed documents (e.g., invoices, W-2s, bank statements), it automatically groups similar documents together and generates a JSON Schema and classification for each group.
386
+
387
+
> **Requires extra dependencies:** `pip install "idp_common[multi_document_discovery]"` or `make setup` from the project root. This installs scikit-learn, scipy, numpy, strands-agents, and pypdfium2.
388
+
389
+
> **Minimum 2 documents per class:** Clusters with fewer than 2 documents are filtered as noise. Ensure you provide at least 2 documents for each expected document type.
390
+
391
+
#### How It Works
392
+
393
+
```mermaid
394
+
graph LR
395
+
A[Document Collection] --> B[Embed]
396
+
B --> C[Cluster]
397
+
C --> D[Analyze]
398
+
D --> E[Reflect]
399
+
E --> F[Discovered Classes + Schemas]
400
+
```
401
+
402
+
1. **Embed** — Each document's first page is rendered to an image and embedded as a vector using Amazon Bedrock (`us.cohere.embed-v4:0` by default)
403
+
2. **Cluster** — Embeddings are clustered using KMeans with automatic K selection via silhouette analysis (scikit-learn). Clusters with fewer than `min_cluster_size` (default: 2) documents are filtered as noise.
404
+
3. **Analyze** — For each cluster, a Strands agent with Claude (`us.anthropic.claude-sonnet-4-6`) examines sample document images and generates a classification name + JSON Schema definition
405
+
4. **Reflect** — The agent produces a Markdown reflection report reviewing all discovered classes, their relationships, and potential overlaps
config_version="v1", # Optional: save to DynamoDB config
479
+
)
480
+
481
+
# Or S3 pipeline
482
+
result = discovery.run_full_pipeline(
483
+
bucket="my-bucket",
484
+
prefix="documents/batch-001/",
485
+
)
486
+
```
487
+
488
+
See [idp_common API Reference — MultiDocumentDiscovery](idpcommon-api-reference.md#multidocumentdiscovery--multi-document-collection-discovery) for full method-level documentation.
489
+
490
+
#### Output
491
+
492
+
The pipeline returns a `MultiDocDiscoveryResult` containing:
493
+
494
+
| Field | Description |
495
+
|-------|-------------|
496
+
| `discovered_classes` | List of discovered classes, each with `classification`, `json_schema`, `document_count`, `sample_doc_ids` |
497
+
| `reflection_report` | Markdown report analyzing all discovered classes |
498
+
| `total_documents` | Total documents processed |
499
+
| `num_clusters` | Number of clusters found |
500
+
| `num_failed_embeddings` | Documents that failed embedding |
0 commit comments