Skip to content

Commit 3aedbce

Browse files
committed
feat(docs): add multi-document discovery API reference
1 parent 321011b commit 3aedbce

1 file changed

Lines changed: 251 additions & 1 deletion

File tree

docs/idpcommon-api-reference.md

Lines changed: 251 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,9 @@ pip install "idp_common[classification]"
1818
pip install "idp_common[extraction]"
1919
pip install "idp_common[evaluation]"
2020

21+
# Install multi-document discovery (includes scikit-learn, scipy, numpy, strands-agents)
22+
pip install "idp_common[multi_document_discovery]"
23+
2124
# Install everything
2225
pip install "idp_common[all]"
2326
```
@@ -222,7 +225,11 @@ Integration with Amazon Bedrock Data Automation for end-to-end document processi
222225

223226
### Discovery (`idp_common.discovery`)
224227

225-
Automatic document class and schema discovery using LLMs.
228+
Automatic document class and schema discovery using LLMs. Includes single-document discovery (`ClassesDiscovery`) and multi-document collection discovery (`MultiDocumentDiscovery`).
229+
230+
#### ClassesDiscovery — Single-Document Discovery
231+
232+
Analyzes a single document to identify its type and generate a JSON Schema.
226233

227234
```python
228235
from idp_common.discovery.classes_discovery import ClassesDiscovery
@@ -235,6 +242,249 @@ result = discovery.discovery_classes_with_document(
235242

236243
**Features:** JSON Schema generation, auto-detect section boundaries, page range selection, ground truth comparison.
237244

245+
#### MultiDocumentDiscovery — Multi-Document Collection Discovery
246+
247+
Discovers document classes from a collection of documents using an embedding-based clustering pipeline: **embed → cluster → analyze → reflect**. Supports both S3-based processing (for Lambda/Step Functions) and local file processing (for CLI/SDK).
248+
249+
> **Requires extra dependencies:** `pip install "idp_common[multi_document_discovery]"` or `make setup` from the project root. This installs scikit-learn, scipy, numpy, strands-agents, and pypdfium2.
250+
251+
> **Minimum 2 documents per class:** Clusters with fewer than 2 documents are filtered as noise. Ensure you provide at least 2 documents for each expected document type.
252+
253+
**Supported file types:** `.pdf`, `.png`, `.jpg`, `.jpeg`, `.tiff`, `.tif`, `.webp`
254+
255+
##### Initialization
256+
257+
```python
258+
from idp_common.discovery.multi_document_discovery import MultiDocumentDiscovery
259+
260+
discovery = MultiDocumentDiscovery(
261+
region="us-east-1",
262+
config={
263+
"embedding_model_id": "us.cohere.embed-v4:0", # Bedrock embedding model
264+
"analysis_model_id": "us.anthropic.claude-sonnet-4-6", # Strands agent model
265+
"max_documents": 500, # Safety limit
266+
"min_cluster_size": 2, # Minimum docs per cluster
267+
"num_sample_documents": 3, # Samples per cluster for analysis
268+
"max_concurrent_embeddings": 5, # Parallel embedding calls
269+
"max_concurrent_clusters": 3, # Parallel cluster analysis
270+
"max_sample_size": 5, # Max images sent to agent
271+
},
272+
)
273+
```
274+
275+
| Parameter | Type | Default | Description |
276+
|-----------|------|---------|-------------|
277+
| `region` | `str` | `"us-east-1"` | AWS region for Bedrock calls |
278+
| `config` | `Dict` | `{}` | Discovery configuration (from `IDPConfig.discovery.multi_document`) |
279+
| `bedrock_client` | `BedrockClient` | `None` | Optional pre-configured Bedrock client |
280+
281+
##### Internal Services
282+
283+
`MultiDocumentDiscovery` composes three specialized services:
284+
285+
| Service | Class | Purpose |
286+
|---------|-------|---------|
287+
| Embedding | `EmbeddingService` | Generates vector embeddings for document images via Bedrock |
288+
| Clustering | `ClusteringService` | KMeans clustering with silhouette analysis (scikit-learn) |
289+
| Analysis | `DiscoveryAgent` | Strands agent with Claude for cluster analysis and JSON Schema generation |
290+
291+
##### Local Pipeline (CLI/SDK)
292+
293+
The local pipeline processes documents from the local filesystem — no AWS infrastructure required beyond Bedrock model access.
294+
295+
**`run_local_pipeline()`** — Main entry point for local discovery
296+
297+
```python
298+
result = discovery.run_local_pipeline(
299+
document_dir="/path/to/documents/", # Scan directory recursively
300+
# document_paths=["/path/a.pdf", "/path/b.png"], # OR explicit file list
301+
config_version="v1", # Optional: save results to DynamoDB config
302+
progress_callback=my_callback, # Optional: callable(step_name, step_data)
303+
)
304+
305+
print(f"Found {result.num_clusters} clusters from {result.total_documents} documents")
306+
print(f"Successful schemas: {result.num_successful_schemas}")
307+
print(result.reflection_report)
308+
309+
for cls in result.discovered_classes:
310+
print(f" {cls['classification']}{cls['document_count']} docs")
311+
print(f" Schema keys: {list(cls['json_schema']['properties'].keys())}")
312+
```
313+
314+
| Parameter | Type | Required | Description |
315+
|-----------|------|----------|-------------|
316+
| `document_dir` | `str` | One of `document_dir` or `document_paths` | Directory to scan recursively |
317+
| `document_paths` | `List[str]` | One of `document_dir` or `document_paths` | Explicit list of file paths |
318+
| `config_version` | `str` | No | Config version to save discovered classes to |
319+
| `progress_callback` | `Callable[[str, Any], None]` | No | Progress updates callback |
320+
321+
**Pipeline steps:**
322+
1. **List** — Scan directory or validate explicit paths
323+
2. **Embed** — Render PDFs to images (pypdfium2), generate embeddings via Bedrock
324+
3. **Cluster** — KMeans with automatic K selection via silhouette analysis
325+
4. **Analyze** — Strands agent examines sample images per cluster, generates classification + JSON Schema
326+
5. **Reflect** — Agent produces a Markdown report reviewing all discovered classes
327+
6. **Save** — (Optional) Merge schemas into a DynamoDB configuration version
328+
329+
**`list_local_documents()`** — Scan for supported files
330+
331+
```python
332+
paths = discovery.list_local_documents(
333+
document_dir="/path/to/documents/", # Recursive scan
334+
max_documents=500, # Safety limit
335+
)
336+
# Returns: ["/abs/path/invoice1.pdf", "/abs/path/w2.png", ...]
337+
```
338+
339+
| Parameter | Type | Required | Description |
340+
|-----------|------|----------|-------------|
341+
| `document_dir` | `str` | One of dir/paths | Directory to scan recursively |
342+
| `document_paths` | `List[str]` | One of dir/paths | Explicit file paths to validate |
343+
| `max_documents` | `int` | No | Override safety limit (default: 500) |
344+
345+
**`generate_embeddings_local()`** — Generate embeddings from local files
346+
347+
```python
348+
embedding_result = discovery.generate_embeddings_local(
349+
file_paths=paths,
350+
progress_callback=lambda done, total: print(f"{done}/{total}"),
351+
)
352+
# embedding_result.embeddings — numpy array (N × embedding_dim)
353+
# embedding_result.valid_keys — file paths that succeeded
354+
# embedding_result.failed_keys — file paths that failed
355+
```
356+
357+
| Parameter | Type | Required | Description |
358+
|-----------|------|----------|-------------|
359+
| `file_paths` | `List[str]` | Yes | Local file paths |
360+
| `progress_callback` | `Callable[[int, int], None]` | No | Progress callback `(done, total)` |
361+
362+
##### S3 Pipeline (Lambda/Step Functions)
363+
364+
The S3 pipeline processes documents stored in Amazon S3. Designed for use with Step Functions orchestration (Lambda handlers call individual steps) or as a single high-level call.
365+
366+
**`run_full_pipeline()`** — End-to-end S3 pipeline
367+
368+
```python
369+
result = discovery.run_full_pipeline(
370+
bucket="my-bucket",
371+
prefix="documents/batch-001/",
372+
config_version="v1", # Optional: save to DynamoDB config
373+
progress_callback=my_callback, # Optional
374+
)
375+
```
376+
377+
| Parameter | Type | Required | Description |
378+
|-----------|------|----------|-------------|
379+
| `bucket` | `str` | Yes | S3 bucket containing documents |
380+
| `prefix` | `str` | Yes | S3 key prefix to scan |
381+
| `config_version` | `str` | No | Config version to save discovered classes to |
382+
| `progress_callback` | `Callable[[str, Any], None]` | No | Progress updates callback |
383+
384+
**Step-by-step methods** (for Step Functions Map state integration):
385+
386+
```python
387+
# Step 1: List documents in S3
388+
s3_keys = discovery.list_documents(bucket="my-bucket", prefix="docs/", max_documents=500)
389+
390+
# Step 2: Generate embeddings
391+
embedding_result = discovery.generate_embeddings(
392+
bucket="my-bucket", s3_keys=s3_keys, progress_callback=cb
393+
)
394+
395+
# Step 3: Cluster
396+
cluster_result = discovery.cluster_documents(embedding_result)
397+
398+
# Step 4: Load images for analysis
399+
images = discovery._load_images_for_analysis(bucket="my-bucket", s3_keys=embedding_result.valid_keys)
400+
401+
# Step 5: Analyze each cluster (suitable for Step Functions Map iteration)
402+
for cluster_id in range(cluster_result.num_clusters):
403+
discovered_class = discovery.analyze_cluster(cluster_id, cluster_result, images)
404+
405+
# Step 6: Generate reflection report
406+
report = discovery.reflect(discovered_classes)
407+
408+
# Step 7: Save to config (optional)
409+
saved = discovery.save_to_config(discovered_classes, config_version="v1",
410+
input_bucket="my-bucket", input_prefix="docs/")
411+
```
412+
413+
| Method | Description |
414+
|--------|-------------|
415+
| `list_documents(bucket, prefix, max_documents)` | List supported files in S3 |
416+
| `generate_embeddings(bucket, s3_keys, progress_callback)` | Generate embeddings for S3 documents |
417+
| `cluster_documents(embedding_result)` | Cluster documents based on embeddings |
418+
| `analyze_cluster(cluster_id, cluster_result, images)` | Analyze a single cluster (returns `DiscoveredClass`) |
419+
| `reflect(discovered_classes)` | Generate Markdown reflection report |
420+
| `save_to_config(discovered_classes, config_version, input_bucket, input_prefix)` | Save to DynamoDB config |
421+
422+
##### Result Objects
423+
424+
**`MultiDocDiscoveryResult`** (dataclass)
425+
426+
| Field | Type | Description |
427+
|-------|------|-------------|
428+
| `discovered_classes` | `List[Dict]` | List of discovered classes as serializable dicts |
429+
| `reflection_report` | `str` | Markdown reflection report |
430+
| `total_documents` | `int` | Total documents processed |
431+
| `num_clusters` | `int` | Number of clusters found |
432+
| `num_failed_embeddings` | `int` | Documents that failed embedding |
433+
| `num_successful_schemas` | `int` | Clusters with successful schema generation |
434+
| `num_failed_schemas` | `int` | Clusters where schema generation failed |
435+
436+
Each entry in `discovered_classes` contains:
437+
438+
| Key | Type | Description |
439+
|-----|------|-------------|
440+
| `cluster_id` | `int` | Cluster identifier |
441+
| `classification` | `str` | Discovered document type name |
442+
| `json_schema` | `Dict` | Generated JSON Schema for extraction |
443+
| `document_count` | `int` | Number of documents in the cluster |
444+
| `sample_doc_ids` | `List[str]` | Sample document identifiers (file paths or S3 keys) |
445+
| `error` | `str \| None` | Error message if analysis failed for this cluster |
446+
447+
##### Progress Callback
448+
449+
Both `run_local_pipeline()` and `run_full_pipeline()` accept a `progress_callback(step_name, step_data)` that receives updates at each pipeline stage:
450+
451+
| Step Name | Data | Description |
452+
|-----------|------|-------------|
453+
| `listing_documents` | `{dir, paths}` or `{bucket, prefix}` | Starting document scan |
454+
| `documents_found` | `{count}` | Number of documents found |
455+
| `generating_embeddings` | `{total}` | Starting embedding generation |
456+
| `embedding_progress` | `{done, total}` | Per-document embedding progress |
457+
| `embeddings_complete` | Serialized `EmbeddingResult` | All embeddings done |
458+
| `clustering` | `{num_documents}` | Starting clustering |
459+
| `clustering_complete` | Serialized `ClusterResult` | Clustering done |
460+
| `analyzing_clusters` | `{total}` | Starting cluster analysis |
461+
| `cluster_analysis_progress` | `{done, total, classification}` | Per-cluster progress |
462+
| `reflecting` || Starting reflection |
463+
| `saving_to_config` | `{version}` | Saving to DynamoDB (if requested) |
464+
| `pipeline_complete` | Full result dict | Pipeline finished |
465+
466+
##### Integration with IDP SDK and CLI
467+
468+
The multi-document discovery pipeline is also accessible through higher-level interfaces:
469+
470+
```python
471+
# Via IDP SDK
472+
from idp_sdk import IDPClient
473+
474+
client = IDPClient()
475+
result = client.discovery.run_multi_doc(
476+
document_dir="/path/to/documents/",
477+
progress_callback=my_callback,
478+
)
479+
```
480+
481+
```bash
482+
# Via IDP CLI
483+
idp-cli discover-multidoc --dir /path/to/documents/
484+
```
485+
486+
See [IDP SDK Reference](idp-sdk.md) and [IDP CLI Reference](idp-cli.md) for full details.
487+
238488
### Schema (`idp_common.schema`)
239489

240490
Dynamic Pydantic v2 model generation from JSON Schema definitions.

0 commit comments

Comments
 (0)