Skip to content

Commit e0dd022

Browse files
committed
Merge remote-tracking branch 'origin/develop' into feature/error-analyzer
2 parents 02d0bb7 + 279ba39 commit e0dd022

7 files changed

Lines changed: 353 additions & 41 deletions

File tree

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,14 @@ SPDX-License-Identifier: MIT-0
77

88
### Added
99

10+
- **Wildcard pattern support for delete-documents**`idp-cli delete-documents` and `client.batch.delete_documents()` now accept a `--pattern` / `pattern` parameter for fnmatch-style wildcard matching (e.g. `"batch-123/*.pdf"`, `"*invoice*"`). Combines with `--status-filter` to delete e.g. all failed invoices across batches.
11+
1012
- **Chandra OCR Lambda Hook Sample** — New `GENAIIDP-chandra-ocr-hook` sample in `samples/lambda-hook-inference/` that integrates [Datalab Chandra OCR 2](https://github.com/datalab-to/chandra) with the LambdaHook feature for high-quality OCR. Supports 90+ languages, math, tables, forms, and handwriting. Uses the Datalab hosted async API (`/api/v1/convert`) with configurable output format (markdown/json/html) and conversion mode (fast/balanced/accurate). Includes standalone SAM template, local test script, and deployment instructions. See `docs/lambda-hook-inference.md` — Chandra OCR Integration section.
1113

14+
### Fixed
15+
16+
- **`delete-documents` fails with DynamoDB errors** — Fixed two bugs in `get_documents_by_batch()`: (1) passing empty `ExpressionAttributeNames={}` when no status filter caused `ValidationException`, and (2) using low-level DynamoDB client type descriptors (`{"S": "..."}`) with the high-level Table resource caused `begins_with` operand type mismatch. Rewrote to use the high-level `Table.scan()` API with `boto3.dynamodb.conditions.Attr`.
17+
1218
## [0.5.4]
1319

1420
### Added

docs/idp-cli.md

Lines changed: 19 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -932,10 +932,11 @@ idp-cli delete-documents [OPTIONS]
932932
**Document Selection (choose ONE):**
933933
- `--document-ids`: Comma-separated list of document IDs (S3 object keys) to delete
934934
- `--batch-id`: Delete all documents in this batch
935+
- `--pattern`: Wildcard pattern to match document keys (e.g. `"batch-123/*.pdf"`, `"*invoice*"`)
935936

936937
**Options:**
937938
- `--stack-name` (required): CloudFormation stack name
938-
- `--status-filter`: Only delete documents with this status (use with --batch-id)
939+
- `--status-filter`: Only delete documents with this status (use with --batch-id or --pattern)
939940
- Options: `FAILED`, `COMPLETED`, `PROCESSING`, `QUEUED`
940941
- `--dry-run`: Show what would be deleted without actually deleting
941942
- `--force`, `-y`: Skip confirmation prompt
@@ -972,6 +973,23 @@ idp-cli delete-documents \
972973
--batch-id cli-batch-20250123 \
973974
--dry-run
974975

976+
# Delete documents matching a wildcard pattern
977+
idp-cli delete-documents \
978+
--stack-name my-stack \
979+
--pattern "batch-123/*.pdf"
980+
981+
# Delete all failed invoice documents across batches
982+
idp-cli delete-documents \
983+
--stack-name my-stack \
984+
--pattern "*invoice*" \
985+
--status-filter FAILED
986+
987+
# Dry run with pattern to preview matches
988+
idp-cli delete-documents \
989+
--stack-name my-stack \
990+
--pattern "*2024*" \
991+
--dry-run
992+
975993
# Force delete without confirmation
976994
idp-cli delete-documents \
977995
--stack-name my-stack \

docs/idp-sdk.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -547,15 +547,18 @@ print(f"Downloaded {result.files_downloaded} source files")
547547

548548
### batch.delete_documents()
549549

550-
Permanently delete all documents in a batch and their associated data from InputBucket, OutputBucket, and DynamoDB.
550+
Permanently delete documents and their associated data from InputBucket, OutputBucket, and DynamoDB. Select documents by batch ID or wildcard pattern.
551551

552552
**Parameters:**
553-
- `batch_id` (str, required): Batch identifier
553+
- `batch_id` (str, optional): Batch identifier (selects all docs containing this string)
554+
- `pattern` (str, optional): Wildcard pattern to match document keys (e.g., `"batch-123/*.pdf"`, `"*invoice*"`)
554555
- `status_filter` (str, optional): Filter by document status (e.g., "FAILED", "COMPLETED")
555556
- `stack_name` (str, optional): Stack name override
556557
- `dry_run` (bool, optional): If True, simulate deletion without actually deleting (default: False)
557558
- `continue_on_error` (bool, optional): Continue deleting if one document fails (default: True)
558559

560+
**Note:** Must specify either `batch_id` or `pattern` (not both).
561+
559562
**Returns:** `BatchDeletionResult` with `success`, `deleted_count`, `failed_count`, `total_count`, `dry_run`, and `results` (list of DocumentDeletionResult)
560563

561564
```python
@@ -568,6 +571,17 @@ result = client.batch.delete_documents(
568571
status_filter="FAILED"
569572
)
570573

574+
# Delete by wildcard pattern
575+
result = client.batch.delete_documents(
576+
pattern="batch-123/*.pdf"
577+
)
578+
579+
# Delete all failed invoices across batches
580+
result = client.batch.delete_documents(
581+
pattern="*invoice*",
582+
status_filter="FAILED"
583+
)
584+
571585
# Dry run
572586
result = client.batch.delete_documents(
573587
batch_id="batch-123",

lib/idp_cli_pkg/idp_cli/cli.py

Lines changed: 44 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1092,10 +1092,14 @@ def delete(
10921092
"--batch-id",
10931093
help="Delete all documents in this batch (alternative to --document-ids)",
10941094
)
1095+
@click.option(
1096+
"--pattern",
1097+
help='Wildcard pattern to match document keys (e.g. "batch-123/*.pdf", "*invoice*")',
1098+
)
10951099
@click.option(
10961100
"--status-filter",
10971101
type=click.Choice(["FAILED", "COMPLETED", "PROCESSING", "QUEUED"]),
1098-
help="Only delete documents with this status (use with --batch-id)",
1102+
help="Only delete documents with this status (use with --batch-id or --pattern)",
10991103
)
11001104
@click.option(
11011105
"--dry-run",
@@ -1113,6 +1117,7 @@ def delete_documents_cmd(
11131117
stack_name: str,
11141118
document_ids: Optional[str],
11151119
batch_id: Optional[str],
1120+
pattern: Optional[str],
11161121
status_filter: Optional[str],
11171122
dry_run: bool,
11181123
force: bool,
@@ -1141,6 +1146,12 @@ def delete_documents_cmd(
11411146
# Delete only failed documents in a batch
11421147
idp-cli delete-documents --stack-name my-stack --batch-id cli-batch-20250123 --status-filter FAILED
11431148
1149+
# Delete documents matching a wildcard pattern
1150+
idp-cli delete-documents --stack-name my-stack --pattern "batch-123/*.pdf"
1151+
1152+
# Delete all failed invoice documents
1153+
idp-cli delete-documents --stack-name my-stack --pattern "*invoice*" --status-filter FAILED
1154+
11441155
# Dry run to see what would be deleted
11451156
idp-cli delete-documents --stack-name my-stack --batch-id cli-batch-20250123 --dry-run
11461157
@@ -1149,19 +1160,24 @@ def delete_documents_cmd(
11491160
"""
11501161
try:
11511162
import boto3
1152-
from idp_common.delete_documents import delete_documents, get_documents_by_batch
1163+
from idp_common.delete_documents import (
1164+
delete_documents,
1165+
get_documents_by_batch,
1166+
get_documents_by_pattern,
1167+
)
11531168
from idp_sdk import IDPClient
11541169

1155-
# Validate input
1156-
if not document_ids and not batch_id:
1170+
# Validate input - exactly one of document_ids, batch_id, or pattern required
1171+
selector_count = sum(1 for x in [document_ids, batch_id, pattern] if x)
1172+
if selector_count == 0:
11571173
console.print(
1158-
"[red]✗ Error: Must specify either --document-ids or --batch-id[/red]"
1174+
"[red]✗ Error: Must specify one of --document-ids, --batch-id, or --pattern[/red]"
11591175
)
11601176
sys.exit(1)
11611177

1162-
if document_ids and batch_id:
1178+
if selector_count > 1:
11631179
console.print(
1164-
"[red]✗ Error: Cannot specify both --document-ids and --batch-id[/red]"
1180+
"[red]✗ Error: Cannot specify more than one of --document-ids, --batch-id, --pattern[/red]"
11651181
)
11661182
sys.exit(1)
11671183

@@ -1190,6 +1206,27 @@ def delete_documents_cmd(
11901206
if document_ids:
11911207
doc_list = [d.strip() for d in document_ids.split(",")]
11921208
console.print(f"Selected {len(doc_list)} document(s) for deletion")
1209+
elif pattern:
1210+
console.print(
1211+
f"[bold blue]Finding documents matching pattern: {pattern}[/bold blue]"
1212+
)
1213+
doc_list = get_documents_by_pattern(
1214+
tracking_table=tracking_table,
1215+
pattern=pattern,
1216+
status_filter=status_filter,
1217+
)
1218+
if not doc_list:
1219+
console.print(
1220+
f"[yellow]No documents found matching pattern: {pattern}[/yellow]"
1221+
)
1222+
if status_filter:
1223+
console.print(
1224+
f"[yellow] (with status filter: {status_filter})[/yellow]"
1225+
)
1226+
sys.exit(0)
1227+
console.print(f"Found {len(doc_list)} document(s) matching pattern")
1228+
if status_filter:
1229+
console.print(f" (filtered by status: {status_filter})")
11931230
else:
11941231
console.print(
11951232
f"[bold blue]Getting documents for batch: {batch_id}[/bold blue]"

lib/idp_common_pkg/idp_common/delete_documents.py

Lines changed: 73 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010
- List entry cleanup with timestamp-aware shard handling
1111
"""
1212

13+
import fnmatch
1314
import logging
1415
from typing import Any, Dict, List, Optional, Tuple
1516

@@ -464,6 +465,38 @@ def delete_documents(
464465
}
465466

466467

468+
def _scan_all_document_keys(
469+
tracking_table, status_filter: Optional[str] = None
470+
) -> List[Dict[str, Any]]:
471+
"""
472+
Scan all document items from the tracking table.
473+
474+
Args:
475+
tracking_table: DynamoDB table resource
476+
status_filter: Optional status filter ('COMPLETED', 'FAILED', 'PROCESSING', etc.)
477+
478+
Returns:
479+
List of DynamoDB items with PK starting with 'doc#'
480+
"""
481+
from boto3.dynamodb.conditions import Attr
482+
483+
items: List[Dict[str, Any]] = []
484+
filter_expr = Attr("PK").begins_with("doc#")
485+
if status_filter:
486+
filter_expr = filter_expr & Attr("Status").eq(status_filter)
487+
488+
scan_kwargs: Dict[str, Any] = {"FilterExpression": filter_expr}
489+
490+
while True:
491+
response = tracking_table.scan(**scan_kwargs)
492+
items.extend(response.get("Items", []))
493+
if "LastEvaluatedKey" not in response:
494+
break
495+
scan_kwargs["ExclusiveStartKey"] = response["LastEvaluatedKey"]
496+
497+
return items
498+
499+
467500
def get_documents_by_batch(
468501
tracking_table, batch_id: str, status_filter: Optional[str] = None
469502
) -> List[str]:
@@ -481,29 +514,47 @@ def get_documents_by_batch(
481514
object_keys = []
482515

483516
try:
484-
# Query the GSI for batch documents (if available) or scan with filter
485-
# For now, we'll use a prefix-based query on the tracking table
486-
paginator = tracking_table.meta.client.get_paginator("scan")
487-
488-
filter_expression = "begins_with(PK, :pk_prefix)"
489-
expression_values = {":pk_prefix": {"S": "doc#"}}
490-
491-
if status_filter:
492-
filter_expression += " AND #status = :status"
493-
expression_values[":status"] = {"S": status_filter}
494-
495-
for page in paginator.paginate(
496-
TableName=tracking_table.table_name,
497-
FilterExpression=filter_expression,
498-
ExpressionAttributeValues=expression_values,
499-
ExpressionAttributeNames={"#status": "Status"} if status_filter else {},
500-
):
501-
for item in page.get("Items", []):
502-
object_key = item.get("ObjectKey", {}).get("S", "")
503-
if batch_id in object_key:
504-
object_keys.append(object_key)
505-
517+
items = _scan_all_document_keys(tracking_table, status_filter)
518+
for item in items:
519+
object_key = item.get("ObjectKey", "")
520+
if batch_id in object_key:
521+
object_keys.append(object_key)
506522
except Exception as e:
507523
logger.error(f"Error getting documents for batch {batch_id}: {str(e)}")
508524

509525
return object_keys
526+
527+
528+
def get_documents_by_pattern(
529+
tracking_table, pattern: str, status_filter: Optional[str] = None
530+
) -> List[str]:
531+
"""
532+
Get all document object keys matching a wildcard pattern.
533+
534+
Uses fnmatch-style patterns (*, ?, [seq], [!seq]).
535+
536+
Examples:
537+
- ``"batch-123/*"`` — all docs in batch-123
538+
- ``"*/invoice*.pdf"`` — any invoice PDF in any batch
539+
- ``"*2025*"`` — any doc with 2025 in the key
540+
541+
Args:
542+
tracking_table: DynamoDB table resource
543+
pattern: Wildcard pattern to match against object keys
544+
status_filter: Optional status filter ('COMPLETED', 'FAILED', 'PROCESSING', etc.)
545+
546+
Returns:
547+
List of matching object keys
548+
"""
549+
object_keys = []
550+
551+
try:
552+
items = _scan_all_document_keys(tracking_table, status_filter)
553+
for item in items:
554+
object_key = item.get("ObjectKey", "")
555+
if fnmatch.fnmatch(object_key, pattern):
556+
object_keys.append(object_key)
557+
except Exception as e:
558+
logger.error(f"Error getting documents for pattern {pattern}: {str(e)}")
559+
560+
return object_keys

0 commit comments

Comments
 (0)