Skip to content

Commit 58b761f

Browse files
author
Taniya Mathur
committed
ressolving merge conflict
2 parents c48033f + 77533f0 commit 58b761f

35 files changed

Lines changed: 7001 additions & 525 deletions

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,3 +58,6 @@ notebooks/doc-split-metrics-exp/
5858
samconfig.toml
5959
..bfg-report
6060
.git-rewrite
61+
62+
# Cline memory bank (session-specific, not project source)
63+
memory-bank/

ADAPTIVE_EXTRACTION_GUIDE.md

Lines changed: 275 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,275 @@
1+
# Adaptive Table Extraction - No minItems Required!
2+
3+
## The Problem You Had
4+
5+
Setting `minItems: 1440` only works for **that specific document**. What about:
6+
- Documents with 50 rows?
7+
- Documents with 500 rows?
8+
- Documents with 5,000 rows?
9+
- Documents with varying lengths?
10+
11+
You'd need different configs for each size, which defeats automation!
12+
13+
## The Solution: Automatic Adaptive Detection
14+
15+
The system **automatically detects table size from OCR** and adjusts agent instructions accordingly. **No minItems constraint needed!**
16+
17+
## How It Works
18+
19+
### 1. OCR Analysis (Happens Automatically)
20+
21+
```python
22+
# System scans OCR text for Markdown table rows
23+
table_rows_detected = count_lines_with_pipes(ocr_text)
24+
# Example: 250 rows detected
25+
```
26+
27+
### 2. Recommendation Strength (Calculated Automatically)
28+
29+
| Detected Rows | Recommendation | Agent Instructions |
30+
|--------------|----------------|-------------------|
31+
| 0-49 | OPTIONAL | Standard guidance (tool available) |
32+
| 50-99 | RECOMMENDED | Gentle recommendation to use tool |
33+
| 100-499 | **STRONGLY_RECOMMENDED** | **Explicit "YOU MUST use tool"** |
34+
| 500+ | **MANDATORY** | **Critical "IMMEDIATELY use tool"** |
35+
36+
### 3. Agent Instructions (Injected Automatically)
37+
38+
#### For 500+ rows (Your 1440-row case):
39+
```
40+
**CRITICAL - MANDATORY TABLE PARSING TOOL USAGE**:
41+
This document contains a large table with 1440+ rows detected by OCR analysis.
42+
You MUST use the parse_table tool for complete and accurate extraction:
43+
44+
1. IMMEDIATELY call parse_table with the full document text
45+
2. DO NOT attempt manual row-by-row LLM extraction
46+
3. Verify parse_table returned ALL expected rows
47+
...
48+
```
49+
50+
#### For 100-499 rows:
51+
```
52+
**IMPORTANT - USE TABLE PARSING TOOL**:
53+
This document contains tabular data with 250+ rows detected.
54+
You MUST use the parse_table tool for accurate and complete extraction:
55+
...
56+
```
57+
58+
#### For 50-99 rows:
59+
```
60+
**RECOMMENDED - TABLE PARSING TOOL**:
61+
Detected a table with 75+ rows. Consider using the parse_table tool:
62+
...
63+
```
64+
65+
## Configuration (Universal - Works for ALL Sizes)
66+
67+
```yaml
68+
# ONE configuration for all document sizes!
69+
extraction:
70+
agentic:
71+
enabled: true
72+
table_parsing:
73+
enabled: true # ← Just enable it
74+
max_empty_line_gap: 3 # ← Handles page breaks
75+
auto_merge_adjacent_tables: true # ← Merges fragments
76+
77+
classes:
78+
- properties:
79+
table_data:
80+
type: array
81+
# NO minItems! System adapts automatically
82+
description: "Table data - adapts to any size"
83+
items:
84+
type: object
85+
properties:
86+
# your columns here
87+
```
88+
89+
That's it! Works for:
90+
- ✅ 10-row invoice
91+
- ✅ 50-row bank statement
92+
- ✅ 250-row transaction log
93+
- ✅ 1,440-row brokerage statement
94+
- ✅ 10,000-row inventory list
95+
96+
**Same config, same code, automatic adaptation!**
97+
98+
## What About minItems?
99+
100+
### ❌ Don't Use minItems as a Trigger
101+
```yaml
102+
# BAD - Only works for this specific document size
103+
holdings:
104+
type: array
105+
minItems: 1440 # ← What if next document has 500 rows?
106+
```
107+
108+
### ✅ Use minItems for Business Constraints (Optional)
109+
```yaml
110+
# GOOD - Express actual business requirement
111+
holdings:
112+
type: array
113+
minItems: 1 # ← "Must have at least 1 holding"
114+
description: "Portfolio holdings - any quantity"
115+
```
116+
117+
### ✅ Or Don't Use minItems at All (Recommended)
118+
```yaml
119+
# BEST - Let OCR analysis handle detection automatically
120+
holdings:
121+
type: array
122+
# No minItems - system adapts to actual document size
123+
description: "Portfolio holdings"
124+
```
125+
126+
The system will:
127+
1. Detect 1,440 rows from OCR → Triggers MANDATORY
128+
2. Inject explicit instructions → Agent uses tool
129+
3. Extract all 1,440 rows → Success!
130+
131+
**And it works the same way for any other document size!**
132+
133+
## Real-World Examples
134+
135+
### Example 1: Invoice with 15 Line Items
136+
```
137+
OCR Analysis: Detected 15 table rows
138+
Recommendation: OPTIONAL
139+
Agent Behavior: Standard extraction (tool available but not required)
140+
Result: 15 items extracted correctly
141+
```
142+
143+
### Example 2: Bank Statement with 75 Transactions
144+
```
145+
OCR Analysis: Detected 75 table rows
146+
Recommendation: RECOMMENDED
147+
Agent Instructions: "**RECOMMENDED - TABLE PARSING TOOL**"
148+
Agent Behavior: Uses parse_table tool (recommended)
149+
Result: 75 transactions extracted correctly with tool
150+
```
151+
152+
### Example 3: Transaction Log with 250 Entries
153+
```
154+
OCR Analysis: Detected 250 table rows
155+
Recommendation: STRONGLY_RECOMMENDED
156+
Agent Instructions: "**IMPORTANT - USE TABLE PARSING TOOL**"
157+
Agent Behavior: MUST use parse_table tool (explicit instructions)
158+
Result: 250 entries extracted completely
159+
```
160+
161+
### Example 4: Brokerage Statement with 1,440 Holdings
162+
```
163+
OCR Analysis: Detected 1,440 table rows
164+
Recommendation: MANDATORY
165+
Agent Instructions: "**CRITICAL - MANDATORY TABLE PARSING TOOL USAGE**"
166+
Agent Behavior: MUST use parse_table tool (critical requirement)
167+
Result: 1,440 holdings extracted completely
168+
```
169+
170+
### Example 5: Inventory with 8,500 SKUs (50 pages)
171+
```
172+
OCR Analysis: Detected 8,500 table rows
173+
Recommendation: MANDATORY
174+
Agent Instructions: "**CRITICAL - MANDATORY...**" + page break handling
175+
Agent Behavior: Uses tool, auto-merges 50-page table fragments
176+
Result: 8,500 SKUs extracted across all pages
177+
```
178+
179+
**All handled by the SAME configuration!**
180+
181+
## Observability - What You See
182+
183+
### Processing Report Shows Detection
184+
```
185+
=== EXTRACTION PROCESSING REPORT ===
186+
187+
Extraction Method: AGENTIC
188+
Processing Time: 54.8 seconds
189+
Status: SUCCESS
190+
191+
OCR Table Detection:
192+
- Tables detected: 1
193+
- Estimated total rows: 1440 ← Automatically detected!
194+
- Tool usage recommendation: MANDATORY ← Automatically calculated!
195+
196+
✓ Table Parsing Tool Decision:
197+
- Expected usage: YES ← Based on 1440 rows
198+
- Actual usage: YES ← Agent followed instructions
199+
- Explanation: Tool was recommended and used as expected
200+
201+
✓ Completeness Validation:
202+
- All schema constraints satisfied
203+
204+
✓ Table Parsing Tool Results:
205+
- Tables parsed: 1
206+
- Total rows extracted: 1440 ← Success!
207+
```
208+
209+
## Tuning (Optional)
210+
211+
### Adjust OCR Thresholds (If Needed)
212+
213+
Default thresholds work well for most cases:
214+
```python
215+
# Current automatic thresholds (in code):
216+
50-99 rows → RECOMMENDED
217+
100-499 rows → STRONGLY_RECOMMENDED
218+
500+ rows → MANDATORY
219+
```
220+
221+
These thresholds are optimized for real-world use:
222+
- **50 rows**: Table is large enough to benefit from deterministic parsing
223+
- **100 rows**: Agent explicitly told to use tool for completeness
224+
- **500 rows**: Critical requirement - manual extraction would be slow/incomplete
225+
226+
But **default thresholds are tested and recommended** for production.
227+
228+
### Adjust Robustness Settings
229+
230+
```yaml
231+
# For high-quality OCR (clean documents):
232+
max_empty_line_gap: 2
233+
234+
# For standard quality (recommended default):
235+
max_empty_line_gap: 3
236+
237+
# For noisy OCR (complex/scanned documents):
238+
max_empty_line_gap: 5
239+
```
240+
241+
## Benefits vs Old Approach
242+
243+
### ❌ Old Way (minItems-based):
244+
- Need specific minItems for each document type
245+
- Must know document size in advance
246+
- One config per document size range
247+
- Fails when document size varies
248+
- Manual configuration burden
249+
250+
### ✅ New Way (OCR-based adaptive):
251+
- **One config for all document sizes**
252+
- **No advance knowledge needed**
253+
- **Automatically scales to document**
254+
- **Handles size variation seamlessly**
255+
- **Zero manual tuning**
256+
257+
## Summary
258+
259+
1. **Don't set minItems for triggering** - let OCR analysis handle it
260+
2. **Enable table_parsing** - system adapts automatically
261+
3. **Use same config for all documents** - 10 rows to 10,000+ rows
262+
4. **Check processing report** - see what was detected and why
263+
264+
The system is now **truly adaptive** and **truly robust** for any table size!
265+
266+
## Quick Start
267+
268+
Use the provided adaptive configuration:
269+
```bash
270+
cp config_adaptive_table_extraction.yaml your_config.yaml
271+
# Edit class properties as needed
272+
# Deploy and test - works for any document size!
273+
```
274+
275+
That's it! Your extraction now automatically adapts to any document, any table size, any number of pages.

CHANGELOG.md

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,20 @@ SPDX-License-Identifier: MIT-0
2020
- **Two Input Modes**: S3 path (select bucket + prefix), zip upload (presigned URL), or local directory (CLI/SDK)
2121
- **Configuration Integration**: Discovered classes are saved directly to the selected config version's `classes` array in DynamoDB, immediately available for document processing without manual schema creation
2222

23+
- **Agentic Extraction Hardening** — Improved robustness, observability, and table parsing for agentic extraction:
24+
- Pre-flight OCR & schema analysis with adaptive guidance strength (RECOMMENDED → STRONGLY_RECOMMENDED → MANDATORY) ensures table parsing tool is used for large tables
25+
- Deterministic Markdown table parser with lookahead recovery, auto-merge of split tables, and configurable `max_empty_line_gap`
26+
- Post-extraction completeness validation against schema constraints with detailed shortfall reporting
27+
- Processing report with tool usage decisions, completeness checks, and root cause diagnostics (new UI tab + CloudWatch logs)
28+
- Thread-safe state management via `contextvars.ContextVar`; deprecated review agent (config fields preserved as no-ops)
29+
- Bug fixes: `patch_buffer_data` slice correction, confidence assessment loop fix, row-based parse success metric, NoneType guard in completeness check
30+
2331
- **Chandra OCR Lambda Hook Sample** — New `GENAIIDP-chandra-ocr-hook` sample in `samples/lambda-hook-inference/` that integrates [Datalab Chandra OCR 2](https://github.com/datalab-to/chandra) with the LambdaHook feature for high-quality OCR. Supports 90+ languages, math, tables, forms, and handwriting. Uses the Datalab hosted async API (`/api/v1/convert`) with configurable output format (markdown/json/html) and conversion mode (fast/balanced/accurate). Includes standalone SAM template, local test script, and deployment instructions. See `docs/lambda-hook-inference.md` — Chandra OCR Integration section.
2432

33+
- **Wildcard pattern support for delete-documents**`idp-cli delete-documents` and `client.batch.delete_documents()` now accept a `--pattern` / `pattern` parameter for fnmatch-style wildcard matching (e.g. `"batch-123/*.pdf"`, `"*invoice*"`). Combines with `--status-filter` to delete e.g. all failed invoices across batches.
34+
- **Prompt Preview** — New "Prompt Preview" tab in the Configuration page lets you preview the actual prompts sent to the LLM for each processing step (Classification, Extraction, Assessment, Summarization). Config-derived placeholders are filled in with real values (class names, cleaned JSON Schema), while document-specific placeholders are shown as highlighted markers. Includes token estimates, copy-to-clipboard, and a substitution details panel showing the exact schema sent to the LLM. Helps optimize document class schemas and prompt templates.
35+
36+
- **IDP CLI `chat` Command & SDK `ChatOperation`** — Interactive Agent Companion Chat from the terminal and programmatic SDK access. Runs the same multi-agent orchestrator as the Web UI locally, with real-time streaming and multi-turn conversation support. Includes Analytics Agent, Error Analyzer Agent, and optionally Code Intelligence Agent (`--enable-code-intelligence`). Available as `idp-cli chat --stack-name <stack>` for interactive use, `--prompt` flag for single-shot scripting, and `client.chat.send_message()` in the Python SDK. See `docs/idp-cli.md#chat`.
2537

2638
### Fixed
2739

CLAUDE.md

Lines changed: 49 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -169,11 +169,50 @@ The unified architecture supports two processing modes, controlled by the `use_b
169169
2. **Pipeline Mode** (formerly Pattern 2)
170170
- OCR with Amazon Textract
171171
- Classification with Bedrock (page-level or holistic)
172-
- Extraction with Bedrock
172+
- Extraction with Bedrock (traditional or agentic)
173173
- Supports few-shot examples
174+
- Optional agentic extraction with deterministic table parsing
174175

175176
> **Note**: The separate `patterns/pattern-1/`, `patterns/pattern-2/`, and `patterns/pattern-3/` directories have been removed. All processing is now in `patterns/unified/`. See [pattern-1.md](docs/pattern-1.md) and [pattern-2.md](docs/pattern-2.md) for historical reference.
176177
178+
### Agentic Extraction with Table Parsing
179+
180+
The extraction service supports an optional **agentic extraction mode** with intelligent table parsing:
181+
182+
**When to Use**:
183+
- Documents with large tables (100+ rows) where completeness is critical
184+
- Bank statements, transaction logs, brokerage statements
185+
- Multi-page tables that may split across OCR page breaks
186+
- Documents where OCR artifacts (empty lines, missing characters) cause data loss
187+
188+
**Key Features**:
189+
- **Intelligent Lookahead Recovery**: Tolerates OCR artifacts (empty lines, missing pipes) by looking ahead to detect table continuation
190+
- **Auto-Merge Table Fragments**: Automatically merges tables with identical columns that were split by page breaks
191+
- **Smart Warnings**: Agent receives actionable warnings (⚠️ fragmentation, ℹ️ recovery) to verify completeness
192+
- **Hybrid Extraction**: Agent uses deterministic parsing for well-structured tables, falls back to LLM for complex layouts
193+
- **Completeness Validation**: Service validates extracted data against schema constraints (e.g., `minItems`)
194+
195+
**Configuration**:
196+
```yaml
197+
extraction:
198+
model: "us.anthropic.claude-sonnet-4-20250514-v1:0"
199+
agentic:
200+
enabled: true
201+
table_parsing:
202+
enabled: true # Enable deterministic table parser tool
203+
max_empty_line_gap: 3 # Tolerate up to 3 empty lines in tables (0-10)
204+
auto_merge_adjacent_tables: true # Merge table fragments
205+
min_confidence_threshold: 95.0 # OCR confidence target (Textract only)
206+
min_parse_success_rate: 0.90 # Quality threshold for parsed results
207+
```
208+
209+
**Tuning**:
210+
- **High-quality OCR**: `max_empty_line_gap: 2`
211+
- **Standard quality**: `max_empty_line_gap: 3` (default)
212+
- **Complex/noisy documents**: `max_empty_line_gap: 5-7`
213+
214+
See `lib/idp_common_pkg/idp_common/extraction/README.md` for detailed documentation.
215+
177216
### Document Processing Flow
178217

179218
1. Documents uploaded to Input S3 bucket trigger EventBridge events
@@ -194,12 +233,19 @@ The unified architecture supports two processing modes, controlled by the `use_b
194233
- `pip install "idp_common[core]"` - minimal dependencies
195234
- `pip install "idp_common[ocr]"` - OCR support
196235
- `pip install "idp_common[classification]"` - Classification support
197-
- `pip install "idp_common[extraction]"` - Extraction support
236+
- `pip install "idp_common[extraction]"` - Extraction support (includes optional agentic mode with deterministic table parsing tool)
198237
- `pip install "idp_common[evaluation]"` - Evaluation support
199238
- `pip install "idp_common[all]"` - everything
200-
- Components: OCR, Classification, Extraction, Evaluation, Summarization, AppSync integration, Reporting, BDA integration
239+
- Components: OCR, Classification, Extraction (supports traditional and agentic modes with intelligent table parsing), Evaluation, Summarization, AppSync integration, Reporting, BDA integration
201240
- Configuration management via DynamoDB
202241
- Document models and data structures
242+
- Extraction features:
243+
- Traditional LLM-based extraction with few-shot examples
244+
- Agentic extraction with tool-based structured output (Strands framework)
245+
- Deterministic Markdown table parser for robust tabular data extraction
246+
- Intelligent recovery from OCR artifacts (empty lines, missing pipes)
247+
- Automatic merging of table fragments split by page breaks
248+
- Hybrid extraction: agent uses parsing for tables, LLM for complex layouts
203249

204250
**`idp_cli`** (`idp_cli/`):
205251
- Command-line interface for deployment and batch processing

0 commit comments

Comments
 (0)