@@ -18,6 +18,9 @@ pip install "idp_common[classification]"
1818pip install " idp_common[extraction]"
1919pip install " idp_common[evaluation]"
2020
21+ # Install multi-document discovery (includes scikit-learn, scipy, numpy, strands-agents)
22+ pip install " idp_common[multi_document_discovery]"
23+
2124# Install everything
2225pip install " idp_common[all]"
2326```
@@ -222,7 +225,11 @@ Integration with Amazon Bedrock Data Automation for end-to-end document processi
222225
223226### Discovery (` idp_common.discovery ` )
224227
225- Automatic document class and schema discovery using LLMs.
228+ Automatic document class and schema discovery using LLMs. Includes single-document discovery (` ClassesDiscovery ` ) and multi-document collection discovery (` MultiDocumentDiscovery ` ).
229+
230+ #### ClassesDiscovery — Single-Document Discovery
231+
232+ Analyzes a single document to identify its type and generate a JSON Schema.
226233
227234``` python
228235from idp_common.discovery.classes_discovery import ClassesDiscovery
@@ -235,6 +242,249 @@ result = discovery.discovery_classes_with_document(
235242
236243** Features:** JSON Schema generation, auto-detect section boundaries, page range selection, ground truth comparison.
237244
245+ #### MultiDocumentDiscovery — Multi-Document Collection Discovery
246+
247+ Discovers document classes from a collection of documents using an embedding-based clustering pipeline: ** embed → cluster → analyze → reflect** . Supports both S3-based processing (for Lambda/Step Functions) and local file processing (for CLI/SDK).
248+
249+ > ** Requires extra dependencies:** ` pip install "idp_common[multi_document_discovery]" ` or ` make setup ` from the project root. This installs scikit-learn, scipy, numpy, strands-agents, and pypdfium2.
250+
251+ > ** Minimum 2 documents per class:** Clusters with fewer than 2 documents are filtered as noise. Ensure you provide at least 2 documents for each expected document type.
252+
253+ ** Supported file types:** ` .pdf ` , ` .png ` , ` .jpg ` , ` .jpeg ` , ` .tiff ` , ` .tif ` , ` .webp `
254+
255+ ##### Initialization
256+
257+ ``` python
258+ from idp_common.discovery.multi_document_discovery import MultiDocumentDiscovery
259+
260+ discovery = MultiDocumentDiscovery(
261+ region = " us-east-1" ,
262+ config = {
263+ " embedding_model_id" : " us.cohere.embed-v4:0" , # Bedrock embedding model
264+ " analysis_model_id" : " us.anthropic.claude-sonnet-4-6" , # Strands agent model
265+ " max_documents" : 500 , # Safety limit
266+ " min_cluster_size" : 2 , # Minimum docs per cluster
267+ " num_sample_documents" : 3 , # Samples per cluster for analysis
268+ " max_concurrent_embeddings" : 5 , # Parallel embedding calls
269+ " max_concurrent_clusters" : 3 , # Parallel cluster analysis
270+ " max_sample_size" : 5 , # Max images sent to agent
271+ },
272+ )
273+ ```
274+
275+ | Parameter | Type | Default | Description |
276+ | -----------| ------| ---------| -------------|
277+ | ` region ` | ` str ` | ` "us-east-1" ` | AWS region for Bedrock calls |
278+ | ` config ` | ` Dict ` | ` {} ` | Discovery configuration (from ` IDPConfig.discovery.multi_document ` ) |
279+ | ` bedrock_client ` | ` BedrockClient ` | ` None ` | Optional pre-configured Bedrock client |
280+
281+ ##### Internal Services
282+
283+ ` MultiDocumentDiscovery ` composes three specialized services:
284+
285+ | Service | Class | Purpose |
286+ | ---------| -------| ---------|
287+ | Embedding | ` EmbeddingService ` | Generates vector embeddings for document images via Bedrock |
288+ | Clustering | ` ClusteringService ` | KMeans clustering with silhouette analysis (scikit-learn) |
289+ | Analysis | ` DiscoveryAgent ` | Strands agent with Claude for cluster analysis and JSON Schema generation |
290+
291+ ##### Local Pipeline (CLI/SDK)
292+
293+ The local pipeline processes documents from the local filesystem — no AWS infrastructure required beyond Bedrock model access.
294+
295+ ** ` run_local_pipeline() ` ** — Main entry point for local discovery
296+
297+ ``` python
298+ result = discovery.run_local_pipeline(
299+ document_dir = " /path/to/documents/" , # Scan directory recursively
300+ # document_paths=["/path/a.pdf", "/path/b.png"], # OR explicit file list
301+ config_version = " v1" , # Optional: save results to DynamoDB config
302+ progress_callback = my_callback, # Optional: callable(step_name, step_data)
303+ )
304+
305+ print (f " Found { result.num_clusters} clusters from { result.total_documents} documents " )
306+ print (f " Successful schemas: { result.num_successful_schemas} " )
307+ print (result.reflection_report)
308+
309+ for cls in result.discovered_classes:
310+ print (f " { cls [' classification' ]} — { cls [' document_count' ]} docs " )
311+ print (f " Schema keys: { list (cls [' json_schema' ][' properties' ].keys())} " )
312+ ```
313+
314+ | Parameter | Type | Required | Description |
315+ | -----------| ------| ----------| -------------|
316+ | ` document_dir ` | ` str ` | One of ` document_dir ` or ` document_paths ` | Directory to scan recursively |
317+ | ` document_paths ` | ` List[str] ` | One of ` document_dir ` or ` document_paths ` | Explicit list of file paths |
318+ | ` config_version ` | ` str ` | No | Config version to save discovered classes to |
319+ | ` progress_callback ` | ` Callable[[str, Any], None] ` | No | Progress updates callback |
320+
321+ ** Pipeline steps:**
322+ 1 . ** List** — Scan directory or validate explicit paths
323+ 2 . ** Embed** — Render PDFs to images (pypdfium2), generate embeddings via Bedrock
324+ 3 . ** Cluster** — KMeans with automatic K selection via silhouette analysis
325+ 4 . ** Analyze** — Strands agent examines sample images per cluster, generates classification + JSON Schema
326+ 5 . ** Reflect** — Agent produces a Markdown report reviewing all discovered classes
327+ 6 . ** Save** — (Optional) Merge schemas into a DynamoDB configuration version
328+
329+ ** ` list_local_documents() ` ** — Scan for supported files
330+
331+ ``` python
332+ paths = discovery.list_local_documents(
333+ document_dir = " /path/to/documents/" , # Recursive scan
334+ max_documents = 500 , # Safety limit
335+ )
336+ # Returns: ["/abs/path/invoice1.pdf", "/abs/path/w2.png", ...]
337+ ```
338+
339+ | Parameter | Type | Required | Description |
340+ | -----------| ------| ----------| -------------|
341+ | ` document_dir ` | ` str ` | One of dir/paths | Directory to scan recursively |
342+ | ` document_paths ` | ` List[str] ` | One of dir/paths | Explicit file paths to validate |
343+ | ` max_documents ` | ` int ` | No | Override safety limit (default: 500) |
344+
345+ ** ` generate_embeddings_local() ` ** — Generate embeddings from local files
346+
347+ ``` python
348+ embedding_result = discovery.generate_embeddings_local(
349+ file_paths = paths,
350+ progress_callback = lambda done , total : print (f " { done} / { total} " ),
351+ )
352+ # embedding_result.embeddings — numpy array (N × embedding_dim)
353+ # embedding_result.valid_keys — file paths that succeeded
354+ # embedding_result.failed_keys — file paths that failed
355+ ```
356+
357+ | Parameter | Type | Required | Description |
358+ | -----------| ------| ----------| -------------|
359+ | ` file_paths ` | ` List[str] ` | Yes | Local file paths |
360+ | ` progress_callback ` | ` Callable[[int, int], None] ` | No | Progress callback ` (done, total) ` |
361+
362+ ##### S3 Pipeline (Lambda/Step Functions)
363+
364+ The S3 pipeline processes documents stored in Amazon S3. Designed for use with Step Functions orchestration (Lambda handlers call individual steps) or as a single high-level call.
365+
366+ ** ` run_full_pipeline() ` ** — End-to-end S3 pipeline
367+
368+ ``` python
369+ result = discovery.run_full_pipeline(
370+ bucket = " my-bucket" ,
371+ prefix = " documents/batch-001/" ,
372+ config_version = " v1" , # Optional: save to DynamoDB config
373+ progress_callback = my_callback, # Optional
374+ )
375+ ```
376+
377+ | Parameter | Type | Required | Description |
378+ | -----------| ------| ----------| -------------|
379+ | ` bucket ` | ` str ` | Yes | S3 bucket containing documents |
380+ | ` prefix ` | ` str ` | Yes | S3 key prefix to scan |
381+ | ` config_version ` | ` str ` | No | Config version to save discovered classes to |
382+ | ` progress_callback ` | ` Callable[[str, Any], None] ` | No | Progress updates callback |
383+
384+ ** Step-by-step methods** (for Step Functions Map state integration):
385+
386+ ``` python
387+ # Step 1: List documents in S3
388+ s3_keys = discovery.list_documents(bucket = " my-bucket" , prefix = " docs/" , max_documents = 500 )
389+
390+ # Step 2: Generate embeddings
391+ embedding_result = discovery.generate_embeddings(
392+ bucket = " my-bucket" , s3_keys = s3_keys, progress_callback = cb
393+ )
394+
395+ # Step 3: Cluster
396+ cluster_result = discovery.cluster_documents(embedding_result)
397+
398+ # Step 4: Load images for analysis
399+ images = discovery._load_images_for_analysis(bucket = " my-bucket" , s3_keys = embedding_result.valid_keys)
400+
401+ # Step 5: Analyze each cluster (suitable for Step Functions Map iteration)
402+ for cluster_id in range (cluster_result.num_clusters):
403+ discovered_class = discovery.analyze_cluster(cluster_id, cluster_result, images)
404+
405+ # Step 6: Generate reflection report
406+ report = discovery.reflect(discovered_classes)
407+
408+ # Step 7: Save to config (optional)
409+ saved = discovery.save_to_config(discovered_classes, config_version = " v1" ,
410+ input_bucket = " my-bucket" , input_prefix = " docs/" )
411+ ```
412+
413+ | Method | Description |
414+ | --------| -------------|
415+ | ` list_documents(bucket, prefix, max_documents) ` | List supported files in S3 |
416+ | ` generate_embeddings(bucket, s3_keys, progress_callback) ` | Generate embeddings for S3 documents |
417+ | ` cluster_documents(embedding_result) ` | Cluster documents based on embeddings |
418+ | ` analyze_cluster(cluster_id, cluster_result, images) ` | Analyze a single cluster (returns ` DiscoveredClass ` ) |
419+ | ` reflect(discovered_classes) ` | Generate Markdown reflection report |
420+ | ` save_to_config(discovered_classes, config_version, input_bucket, input_prefix) ` | Save to DynamoDB config |
421+
422+ ##### Result Objects
423+
424+ ** ` MultiDocDiscoveryResult ` ** (dataclass)
425+
426+ | Field | Type | Description |
427+ | -------| ------| -------------|
428+ | ` discovered_classes ` | ` List[Dict] ` | List of discovered classes as serializable dicts |
429+ | ` reflection_report ` | ` str ` | Markdown reflection report |
430+ | ` total_documents ` | ` int ` | Total documents processed |
431+ | ` num_clusters ` | ` int ` | Number of clusters found |
432+ | ` num_failed_embeddings ` | ` int ` | Documents that failed embedding |
433+ | ` num_successful_schemas ` | ` int ` | Clusters with successful schema generation |
434+ | ` num_failed_schemas ` | ` int ` | Clusters where schema generation failed |
435+
436+ Each entry in ` discovered_classes ` contains:
437+
438+ | Key | Type | Description |
439+ | -----| ------| -------------|
440+ | ` cluster_id ` | ` int ` | Cluster identifier |
441+ | ` classification ` | ` str ` | Discovered document type name |
442+ | ` json_schema ` | ` Dict ` | Generated JSON Schema for extraction |
443+ | ` document_count ` | ` int ` | Number of documents in the cluster |
444+ | ` sample_doc_ids ` | ` List[str] ` | Sample document identifiers (file paths or S3 keys) |
445+ | ` error ` | ` str \| None ` | Error message if analysis failed for this cluster |
446+
447+ ##### Progress Callback
448+
449+ Both ` run_local_pipeline() ` and ` run_full_pipeline() ` accept a ` progress_callback(step_name, step_data) ` that receives updates at each pipeline stage:
450+
451+ | Step Name | Data | Description |
452+ | -----------| ------| -------------|
453+ | ` listing_documents ` | ` {dir, paths} ` or ` {bucket, prefix} ` | Starting document scan |
454+ | ` documents_found ` | ` {count} ` | Number of documents found |
455+ | ` generating_embeddings ` | ` {total} ` | Starting embedding generation |
456+ | ` embedding_progress ` | ` {done, total} ` | Per-document embedding progress |
457+ | ` embeddings_complete ` | Serialized ` EmbeddingResult ` | All embeddings done |
458+ | ` clustering ` | ` {num_documents} ` | Starting clustering |
459+ | ` clustering_complete ` | Serialized ` ClusterResult ` | Clustering done |
460+ | ` analyzing_clusters ` | ` {total} ` | Starting cluster analysis |
461+ | ` cluster_analysis_progress ` | ` {done, total, classification} ` | Per-cluster progress |
462+ | ` reflecting ` | — | Starting reflection |
463+ | ` saving_to_config ` | ` {version} ` | Saving to DynamoDB (if requested) |
464+ | ` pipeline_complete ` | Full result dict | Pipeline finished |
465+
466+ ##### Integration with IDP SDK and CLI
467+
468+ The multi-document discovery pipeline is also accessible through higher-level interfaces:
469+
470+ ``` python
471+ # Via IDP SDK
472+ from idp_sdk import IDPClient
473+
474+ client = IDPClient()
475+ result = client.discovery.run_multi_doc(
476+ document_dir = " /path/to/documents/" ,
477+ progress_callback = my_callback,
478+ )
479+ ```
480+
481+ ``` bash
482+ # Via IDP CLI
483+ idp-cli discover-multidoc --dir /path/to/documents/
484+ ```
485+
486+ See [ IDP SDK Reference] ( idp-sdk.md ) and [ IDP CLI Reference] ( idp-cli.md ) for full details.
487+
238488### Schema (` idp_common.schema ` )
239489
240490Dynamic Pydantic v2 model generation from JSON Schema definitions.
0 commit comments